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Abstract — We present a development of parts of rate-distortion theory and pattern-matching 
algorithms for lossy data compression, centered around a lossy version of the Asymptotic Equipartition 
Property (AEP). This treatment closely parallels the corresponding development in lossless compres- 
sion, a point of view that was advanced in an important paper of Wyner and Ziv in 1989. In the 
lossless case we review how the AEP underlies the analysis of the Lempel-Ziv algorithm by viewing it 
as a random code and reducing it to the idealized Shannon code. This also provides information about 
the redundancy of the Lempel-Ziv algorithm and about the asymptotic behavior of several relevant 
quantities. 

In the lossy case we give various versions of the statement of the generalized AEP and we outline 
the general methodology of its proof via large deviations. Its relationship with Barron and Orey's 
generalized AEP is also discussed. The lossy AEP is applied to: (i) prove strengthened versions 
of Shannon's direct source coding theorem and universal coding theorems; (ii) characterize the per- 
formance of "mismatched" codebooks in lossy data compression; (iii) analyze the performance of 
pattern- matching algorithms for lossy compression (including Lempel-Ziv schemes); (iv) determine 
the first order asymptotics of waiting times (with distortion) between stationary processes; (v) char- 
acterize the best achievable rate of "weighted" codebooks as an optimal sphere-covering exponent. We 
then present a refinement to the lossy AEP and use it to: (i) prove second order (direct and converse) 
lossy source coding theorems, including universal coding theorems; (ii) characterize which sources 
are quantitatively easier to compress; (iii) determine the second order asymptotics of waiting times 
between stationary processes; (iv) determine the precise asymptotic behavior of longest match-lengths 
between stationary processes. Extensions to random fields are also given. 

Index Terms — Rate-distortion theory, pattern-matching, large deviations, data compression. 
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1 Introduction 



1.1 Lossless Data Compression 

It is probably only a slight exaggeration to say that the central piece of mathematics in the proof of 
almost any lossless coding theorem is provided by the Asymptotic Equipartition Property, or AEP. 
Suppose we want to (losslessly) compress a message = (Xi, X2, . . . , Xn) generated by a stationary 
memoryless source X = {Xn ; n > 1} where each Xi takes values in the finite alphabet A (much 
more general situations will be considered later). For this source, the AEP states that as n — > 00 

- - log2 ) H in probability (1) 

n 

where P is the common distribution of the independent and identically distributed (i.i.d.) random 
variables Xi, P" denotes the (product) joint distribution of Xf, and H = E[—log2 P{Xi)] is the 



entropy rate of the source - see Shannon's original paper |68, Theorem 3] or Cover and Thomas' text 



[20, Chapter 4]. [Here and throughout the paper, log2 denotes the logarithm taken to base 2, and 
log denotes the natural logarithm.] From (|l]) we can immediately extract some useful information: 
It implies that when n is large the message X^ will most likely have probability at least as high as 

2-n(H+e). 

P"(Xf) > 2-"(^+^) with high probability. (2) 

But there cannot be many high-probability messages. In fact, there can be at most 2'^^^'^''^ messages 
with P"(X") > 2^"(^+'^), so we need approximately 2"^ representative messages from the source X 
in order to cover our bets (with high probability). If we let 7^ be the set of high-probability strings 
Xi S having P"(x^) > 2~"'^^~^''\ then with high probability we can correctly represent the source 
output Xi by an element of 7^. Since there are no more than 2"(^+^) of them, we need no more than 
nH bits to correctly encode X^. 

Shannon's Random Code. Another way to extract information from (|l|) is as follows. The fact 
that for large n we typically have P"(Xf) « 2'''^ also means that if we independently generate 
another random string, say 1"", from the same distribution as the source, the probability that X" 
is the same as Yj" is about 2~"^ . Suppose that instead of using the strings in 7^ above as our 
representatives for the source, we decided to independently generate a collection of random strings 
from the distribution P"; how many would we need? Given a source string X", the probability that 
any one of the matches it is ~ 2~'^^ , so in order to have high probability of success in representing 
X" without error we should choose approximately 2^''^^^^'^ random strings y". Therefore, whether 
we choose the set of representatives systematically or randomly, we always need about 2"^ strings in 
order to be able to encode X" losslessly with high probability. Note that the randomly generated set 



Tn is nothing but Shannon's random codebook [S9[ specialized to the case of lossless compression. 



Idealized Lempel-Ziv Coding. In 1989, in a very influential paper [75|, Wyner and Ziv took 
the above argument several steps further. Aiming to "obtain insight into the workings of [...] the 
Lempel-Ziv data compression algorithm," they considered the following coding scenario: Suppose 
that an encoder and a decoder both have available to them a long database, say an infinitely long 
string Y^ = (Yi,Y2, ■ ■ ■) that is independently generated from the same distribution as the source. 
Given a source string Xp to be transmitted, the encoder looks for the first appearance of Xp in the 
database (assuming, for now, that it does appear somewhere). Let W denote the position of this first 
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appearance, that is, let W be the smallest integer for which Y^~^"'~^ = (1^,1^+1, . . . ,lV+n-i) is 
equal to Xf. Then all the encoder has to do is it to tell the decoder the value of W; the decoder 
can read off the string Y^'^"'^^ and recover perfectly. This description can be given using (cf. 
(HlQ) no more than 

iiX]") = log2 W + 0(log2 log2 W) bits. (3) 

How good is this scheme? First note that, for any given source string X^, the random variable 
W records the first "success" in a sequence of trials ("Is Y"" = Xfl" "Is Y2~^^ = X]"?" and so on), 
each of which has probability of success p = Although these trials are not independent, for 

large n they are almost independent (in a sense that will be made precise below), so the distribution 
of W is close to a geometric with parameter p = P^^X^). For long strings X" (i.e., for large n) p is 
small, and W is typically close to its expected value, which is approximately equal to the mean of a 
geometric random variable with parameter p, namely 1/p. But the AEP tells us that, when n is large, 
p = ^"(X") ~ 2^"^, so we expect W to be typically around 2^^ . Hence, from (|3|) the description 
length £{Xf) of X" will be, to first order, 

e{X^) « - log2 P"(Xi") « nH bits, with high probability. 

This shows that above scheme is asymptotically optimal, in that its limiting compression ratio is equal 
to the entropy. 



Practical Lempel-Ziv Coding. The Lempel-Ziv algorithm [l86|[l87| and its many variants (see, e.g., 
[Q, Ch. 8]) are some of the most successful data compression algorithms used in practice. Roughly 
speaking, the main idea behind these algorithms is to use the message's own past as a database for 
future encoding. Instead of looking for the first match in an infinitely long database, in practice the 
encoder looks for the longest match in a database of fixed length. The analysis in |7^] of the idealized 
scheme described above was the first step in providing a probabilistic justification for the optimality 
of the actual practical algorithms. Subsequently, in [76| and [0] Wyner and Ziv established the 
asymptotic optimality of the Sliding- Window (SWLZ) and the Fixed-Database (FDLZ) versions of 
the algorithm. 



1.2 Lossy Data Compression 

A similar development to the one outlined above can be given in the case of lossy data compression, 
this time centered around a lossy analog of the AEP [^ ]. To motivate this discussion we look at 
Shannon's original random coding proof of the (direct) lossy source coding theorem [|6S|]. 

Shannon's Random Code. Suppose we want to describe the output X" of a memoryless source, 
with distortion D or less with respect to a family of single-letter distortion measures {pn}- Let Q* 
be the optimum reproduction distribution on A", where A is the reproduction alphabet. Shannon's 
random coding argument says that we should construct a codebook 7^ of 2"(^(^)+'^) codewords Y" 
generated i.i.d. from Q*, where R{D) is the rate-distortion function of the source (in bits). The proof 
that 2"(-^(-^)+'^) codewords indeed suffice is based on the following result. Lemma 1 in [|6S|]. 

Shannon's "Lemma 1": For S A" let B{x^,D) denote the distortion-ball of radius D around 
Xi, i.e., the collection of all reproduction strings y" € with p„(x",y") < D. When n is large:|^ 

Q;(5(Xi", D)) > 2-"(^(^)+^) with high probability. (4) 

^The notation in Shannon's statement is slightly different, and he considers the more general case of ergodic sources. 
For the sake of clarity we restrict attention here to the i.i.d. case. 
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In the proof of the coding theorem this lemma plays the same role that the AEP played in the 
lossless case; notice the similarity between (Q) and its analog (^) in the lossless case. 

Let's fix a source string X" to be encoded. The probability that matches any one of the 
codewords in 7^ is 

Pr{/,„(Xr, n") < D I = Vt{Y^ G B{X^,D) \ X^] = Ql{B{X^, D)) 

and by the lemma this probability is at least 2~"(^(^)+'^\ Therefore, with 2"'(^(^)+'^) independent 
codewords to choose from, we have a good chance for finding a match with distortion D or less. 

Generalized AEP and Applications. A stronger and more general version of Lemma 1 will be 
our starting point in this paper. In the following section we will prove a generalized AEP: For any 
product measure on 

--logQ''{BiX^,D))^R,{P,Q,D) w.p.l (5) 
n 

where Ri{P, Q,D) is a (non-random) function of the distributions P and Q and of the distortion level 
D. [We will later prove several variants of (|5|) under much weaker assumptions.] 

Like the AEP in the lossless case, the generalized AEP and its refinements find numerous appli- 
cations in data compression, universal data compression, and in general pattern-matching questions. 
Many of these applications were inspired by the treatment in Wyner and Ziv's 1989 paper [^]. A (very 
incomplete) sample of subsequent work in the Wyner-Ziv spirit includes the papers [71||55|[^9|[^ on 
lossy data compression, and [Esl p^] on pattern-matching. 



Aaron Wyner himself remained active in this field for the following ten years, and his last paper 
[78|, co-written with J. Ziv and A.J. Wyner, was a review paper on this subject. In the present paper 
we review the corresponding developments in the lossy case, and in the process we add new results 
(and some new proofs of recent results) in an attempt to present a more complete picture. 

1.3 Central Themes, Paper Outline 

In Section 2 we give an extensive discussion of the generalized AEP. By now there are numerous 
different proofs under different assumptions, and we offer a streamlined approach to the most general 
versions using techniques from large deviation theory (cf. and Bucklew's earlier work 



[13|[|T4|). We also discuss the relationship of the generalized AEP with the classical extensions of the 
AEP (due to Barron [| and Orey |l|) to processes with densities. We establish a formal connection 
between these two by looking at the limit of the distortion level D I 0. 

In Section 3 we develop applications of the generalized AEP to a number of related problems. 
We show how the generalized AEP can be used to determine the asymptotic behavior of Shannon's 
random coding scheme, and we discuss the role of mismatch in lossy data compression. We also 
determine the first order asymptotic behavior of waiting times and longest match-lengths between 
stationary processes. The main ideas used here are strong approximation |^6| and duality ||7^. We 
present strengthened versions of Shannon's direct lossy source coding theorem (and of a corresponding 
universal coding theorem), showing that almost all random codebooks achieve essentially the same 
compression performance. A lossy version of the Lempel-Ziv algorithm is recalled, which achieves 
optimal compression performance (asymptotically) as well as polynomial complexity at the encoder. 
We also discuss how the classical source coding problem can be generalized to a question about 
weighted sphere-covering. The answer to this question gives, as corollaries. Shannon's coding theorems. 
Stein's lemma in hypothesis testing, and some converse concentration inequalities. 



3 



Section 4 is devoted to second order refinements of the AEP and the generalized AEP. It is shown, 
for example, that under certain conditions — logP"(X") and — log Q"^ {B {X^ , D)) are asymptotically 
Gaussian. These refinements are used in Section 5 to provide corresponding second order results 
(such as central limit theorems) for the applications considered in Section 3. We prove second order 
asymptotic results for waiting times and longest match-lengths. Precise redundancy rates are given 
for Shannon's random code, and converse coding theorems show that the random code achieves the 
optimal pointwise redundancy, up to terms of order (logn). For i.i.d. sources the pointwise redundancy 
is typically of order dy/n, where a is the minimal coding variance of the source. When a = these 
fluctuations disappear, and the best pointwise redundancy is of order (logn). The question of exactly 
when a can be equal to zero is briefly discussed. 

Finally, Sections 6 and 7 contain generalizations of some of the above results to random fields. 
All the results stated there are new, although most of them are straightforward generalizations of 
corresponding one-dimensional results. 

2 The Generalized AEP 
2.1 Notation and Definitions 

We begin by introducing some basic definitions and notation that will remain in effect for the rest 
of the paper. We will consider a stationary ergodic process X = {X„ ; n G Z} taking values in a 
general alphabet A.^ When talking about data compression, X will be our source and A will be called 
the source alphabet. We write Xj for the vector of random variables Xf = (Xj,Xj+i, . . . ,Xj), and 
similarly xj = (xj, Xj+i, . . . , xj) G A^~'''^^ for a realization of these random variables, — oo <i<j< oo. 
We let Pn denote the marginal distribution of on A^ (n > 1), and write P for the distribution of the 
whole process. Similarly, we take Y = {Yn ; n G Z} to be a stationary ergodic process taking values 
in the (possibly different) alphabet A? In the context of data compression, A is the reproduction 
alphabet and Y has the "codebook" distribution. We write Qn for the marginal distribution of YJ^ 
on A"", n > 1, and Q for the distribution of the whole process Y . We will always assume that the 
process Y is independent of X. 

Let /? : A X j4 ^ [0, oo) be an arbitrary nonnegative (measurable) function, and define a sequence 
of single-letter distortion measures /?„ : A^ x A^ [0, oo) by 

p„(x7,y^) ^-Y^p{x,,m) X? G A^, G >. 

1=1 

Given D >0 and x" G A"', we write i?(x", D) for the distortion-ball of radius D around x": 

B{x-„D) = {y{^eA'^ : /,„(x?, y?) < I?}- 

Throughout the paper, log denotes the natural logarithm and log2 the logarithm to base 2. Unless 
otherwise mentioned, all familiar information-theoretic quantities (such as the entropy, mutual infor- 
mation, and so on) are assume to be defined in terms of natural logarithms (and are therefore given 
in nats). 

^To avoid uninteresting technicalities, we will assume throughout that ^4 is a complete, separable metric space, 
equipped with its associated Borel cr-field A. Similarly we take {A, A) to be the Borel measurable space corresponding 
to a complete, separable metric space A. 
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2.2 Generalized AEP When Y is I.I.D. 



In the case when A is finite, the classical AEP, also known as the Shannon-McMillan-Breiman theorem 
(see [pn. Chapter 15] or the original papers [p8| [E^ [|lO[ [p| ) , states that as n ^ oo 



--logPn{X'^)^H{F) w.p.l (6) 
n 

where 

H{F) = lim 

n— >oo 77, 

is the entropy rate of the process X (in nats, since we are taking logarithms to base e). As we saw in 
the Introduction, in lossy data compression the role of the AEP is taken up by the result of Shannon's 
"Lemma 1" and, more generally, by statements of the form 

~-Qn{B{X^, D)) ^ R{F, Q, D) w.p.l 
n 

for some non-random "rate-function" i?(P, Q, D). 

First we consider the simplest case where Y is assumed to be an i.i.d. process. We write Q = Qi 
for its first order marginal, so that Qn = Q"", for n > 1. Similarly we write P = Pi for the first order 
marginal of X . Let 

Dmin = Ep[essmf p{X,Y)] (7) 

I?av = Ep^Q[piX,Y)]. (8) 

[Recall that the essential infimum of a function g{Y) of the random variable Y with distribution Q is 
defined as ess inf Yr^q g{Y) = sup{t G M : Q{g{Y) > t} = 1}.] 

Clearly < Dmin ^ Dav To avoid the trivial case when p{x,y) is essentially constant for (P- 
almost) all x £ A, we assume that with positive P-probability p{x, y) is not essentially constant in y, 
that is: 

^min < I^av (9) 

Note also that for D greater than Davj the probability Q^{B{X]^,D)) — > 1 as n ^ oo (this is easy to 
see by the ergodic theorem), so we restrict our attention to distortion levels D < -Dav 

Theorem 1. Generalized AEP when Y is i.i.d.: Let X he a stationary ergodic process and Y be 
i.i.d. with marginal distribution Q on A. Assume that = Epxq[p{X,Y)] is finite. Then for any 

D £ (Z)mm,-Dav) 

~-logQ"{B{X^,D)) ^ Ri{P,Q,D) w.p.l. 
n 

The rate-function Ri{P, Q, D) is defined as 

Ri{P,Q,D) =\niH{W\\P X Q) 
w 

where i7(VF||y) denotes the relative entropy between two distributions W and V , 

H{W\\V) 



A / Ew[log^] if the density ^ exists. 



oo otherwise 
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and the infimum is taken over all joint distributions W on A x A such that the first marginal of W is 
P and Ew[p{X,Y)] < D. 

Example 1: The rate-function Ri{P,Q,D) when Q is Gaussian: Although in general the rate- 
function Ri{P,Q,D) cannot be evaluated explicitly, here we show that it is possible to obtain an 
exact expression for Ri{P,Q,D) in the special case when p{x,y) = {x — y)^, X is a real- valued, 
process, and Q is a Gaussian measure on M. Specifically, assume that X is a zero-mean, stationary 
ergodic process with finite variance cr^ = Var(Xi) < oo, and take Q to be a zero-mean Gaussian 
measure with variance r^, i.e., Q N{0, r^). Under these assumptions, it is easy to see that -Dmin = 
and Z?av = cr'^ + t^. Moreover, with the help of Proposition 2 below, Ri{P,Q, D) can be explicitly 
evaluated as: 

D = 

R,{P,Q,D) = { Uog(^)- (^-^)j:.r'\ 0<D<a^ + r^ 



where 




We will come back to this example when considering mismatched rate-distortion codebooks in Sec- 
tion 3.2. 

Remark 1: In more familiar information-theoretic terms, the rate-function Ri{P,Q,D) can equiv- 
alently be defined as (cf. [f79[) 

Ri{P,Q,D) = inf [I{X;Y) + H{Qy\\Q)] 

where I{X;Y) denotes the mutual information (in nats) between the random variables X and Y, and 
the infimum is over all jointly distributed random variables {X, Y) with values in A x A such that X 
has distribution P, E[p{X, Y)] < D, and Qy denotes the distribution of Y. 

Remark 2: The assumption that Y is i.i.d. is clearly restrictive and it will be relaxed below. On 
the other hand the assumptions on the distortion measure p seem to be minimal; we simply assume 
that p has finite expectation (in the more general results below p is assumed to be bounded). In this 
form, the result of Theorem 1 is new. 

Discussion of Proof : Let's fix a realization x"^ of X. The probability Q'^{B(X^ , D)) can be written 

as 

Pr {Fi" G BiX'^,D) \ X^ = x^} = Pi { -V" p{xi,Y{) < D 



K i=l 



Since the distortion level D is taken smaller than the average value -Davi this is large deviations 
probability for the partial sums (1/?^) X^i^i ^* independent (but not identically distributed) 

random variables Zj = p(xi,Yi). The proof is essentially an application of the Gartner-Ellis theorem 
of large deviations to the random variables {Zi}. 

Proof Outline: Choose and fix a realization xf^ of X and define the random variables Zi = p{xi, Yi). 

Let 

1 " 

Sn = ~ / Zi 



n . 

4=1 



and define the log-moment generating functions of the normalized partial sums Sn by 

A„(A) = log^Q. (e^^") , A<0. 
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Then for any A < 0, by the ergodic theorem we have that 



1 1 

-An{nX) = - J^logEQ 
n n ^-^ 

1=1 



A(A) = Ep 



log En e 



,\p{X,Y) 



(10) 



for P-almost any reahzation x^. Now we would hke to apply the Gartner-Ellis theorem, but first we 
need to check some simple properties of the function A(A). Note that A(A) < and also (by Jensen's 
inequality) A(A) > XD av > — oo, for all A < 0. Moreover, A(A) is twice differentiable in A with 



A\X) = Ep^Q i p{X,Y) 



=Ap(x,y) 



and 



A"(A) = Ep 



Eq{p\X,Y) 



=Ap(x,y) 



EQ[eMx,Y')] 



EQ[eMX,Y')] 



Eq{p{X,Y) 



=Ap(x,y) 



EQ[eMx,Y')] 



(this differentiability is easily verified by an application of the dominated convergence theorem). By 
the Cauchy-Schwarz inequality A"(A) > for all A < 0, and in fact A"(A) is strictly positive due to 
assumption (P). Also it is not hard to verify that 



and 



limA'(A) = L»a 

ATO 



lim A' (A) = D„ 

AJ.— CO 



(11) 



Since D G (-Dmin, -Dav), there exists a unique A* < with A'(A*) = D, and therefore the Fenchel- 
Legendre transform of A(A) evaluated at D is 



A 



A*{D) = sup[XD - A(A)] = X*D - A(A* 

A<0 



Now we can apply the Gartner-Ellis theorem |25, Theorem 2.3.6] to deduce from ( pXj[) that with 
P-probability one 

--logQ^iBiX^,D))^A*{D). 
n 

The proof is complete upon noticing that A*{D) is nothing but Ri{P, Q, D). This is stated and proved 
in the following proposition. □ 

Proposition 2. Characterization of the Rate Function: In the notation of the proof of Theorem 1, 
A*{D) = Ri{P,Q,D), for D G {D^i^,D,,). 

Proof Outline: Under additional assumptions on the distortion measure p this has appeared in 
various papers (see, e.g., |23||81|). For completeness, we offer a proof sketch here. 
In the notation of the above proof, consider the measure W on A x A defined by 

A*p(a;,j/) 



dW{x,y) _ 
dPxQ ~ Eqle^'pi^^^)]' 
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Obviously the first marginal of is P and it is easy to check that that E]y[p{X,Y)] = A'(A*) = D. 
Therefore, by the definitions of Ri{P, Q, D) and W, and by the choice of A*: 

Ri{P, Q, D) < H{W\\P xQ) = X*D - A(A*) = A*{D). (12) 

To prove the corresponding lower bound we first claim that for any measurable function (p : A ^ 
(—00,0], and any probability measure Q' on A, 

H{Q'\\Q) > Eq.{^{Y)) - log£;Q(e<^(^)). (13) 

Let Q(f, denote the probability measure on A such that dQ^/dQ = e't' /Egie't'^^^). Clearly, it suffices to 
prove ([l^) in case dQ' /dQ exists, in which case the difference between the left and right hand sides is 



Given an arbitrary candidate W as in the definition of Ri{P, Q, D) and any x G ^, we take Q' = W{-\x) 



and (t){y) = X*p{x,y) in (13) to get that 

HiWi-\x)\\Q{-)) > \*EwiYi.)[p{x,Y)]-\ogEQ{e^'P(^^y^). 

Substituting X for x, taking expectations of both sides with respect to P, and recalling that A* < 
and Ew[p{X,Y)] < D, we get: 

H{W\\Q) > X*D - A(A*) = A*{D). 

Since W was arbitrary it follows that Ri{P, Q, D) > A*{D), and together with ( |l^ ) this completes the 
proof. □ 

2.3 Generalized AEP When Y is Not I.I.D. 

Next we present two versions of the generalized AEP that hold when 1^ is a stationary dependent 
process, under some additional conditions. 

Throughout this section we will assume that the distortion measure is essentially bounded 

-Dmax = esssup p{Xi,Yi) < oo. (14) 
(Xi,yi)~PixQi 

We let Dav be defined as earlier, Dav = Ep-^^xQi[p{Xi,Yi)], and for n > 1 we let 



essinf p„(Xr,yi- 



(n) 

It is easy to see that nD^^:^^ is a finite, superadditive sequence, and therefore we can also define 

D ■ — lim D*-"-* — sun D*-"'' 

As before, we will assume that the distortion measure p is not essentially constant, that is, I?min < -^av 
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We first state a version of tlie generalized AEP tliat was recently proved by Chi for processes 
Y satisfying a rather strong mixing condition: We say that the stationary process Y is -mixing, if 
for all d large enough there is a finite constant Cd such that 

c^'Q{A)Q{B) < q{A nB)< cMA)Q{B) 

for all events A G a{Y^^) and B G a{Y^), where <jiY/) denotes the cr-field generated by Y^ . Recall 
the usual definition according to which Y is called ip-mixing if in fact the constants — > 1 as d — > oo; 
see iP for more details. Clearly ■i/'^-mixing is weaker than ip-mixing. 

Theorem 3. Generalized AEP when Y is -mixing j7^/; Let X and Y be stationary ergodic 
processes. Assume that Y is ^/'^-mixing, and that the distortion measure p is bounded. Then for all 

D G (Dmm,^av) 

~-logQn{B{Xl\D))^R{F,Q,D) w.p.l (15) 
n 

where Q, D) is the rate-function defined by 

R{F,q,D)= \im RniPn,Qn,D) (16) 

n— >oo 

where, for n > 1, 

Rn{Pn,Qn,D) = ini n'^ H iVn\\Pn X Q„) 

and the infimum is taken over all joint distributions on x A" such that the ^"'-marginal of 
is Pn and SyJp„(Xi",yi")] < D. 

As we discussed in the previous section, the proof of most versions of the generalized AEP con- 
sistst of two steps: First a "conditional large deviations" result is proved for the random variables 
{pnixi,Yi') ; 71 > 1}, where is a fixed realization of the process X. Second, the rate-function 
i?(P, Q, D) is characterized as the limit of a sequence of minimizations in terms of relative entropy. 

In a subseqeunt paper, Chi [^] showed that the first of these steps (the large deviations part) 
remains valid under a condition weaker than ^^-mixing, condition (S) of |T^. In the following 
theorem we give a general version of the second step; we prove that the generalized AEP ( [TBD and the 
formula ([l6| ) for the rate-function remain valid as long as the random variables {/9„(x",y") ; n > 1} 
satisfy a large deviations principle (LDP) with some deterministic, convex rate-function (see [^5| for 
the precise meaning of this statement). 

Theorem 4- Let X and Y be stationary processes. Assume that p is bounded, and that with 
P-probability one, conditional on X^ = x'^ , the random variables {pn{xi,YJ^) ; n > 1} satisfy a 
large deviations principle with some deterministic, convex rate-function. Then, both (^) and (p^) 
hold for any D G (.Dmin, -Dav), except possibly at the point D = I?^^ , where 

^iTn = inf{I) > : supi?„(P„, Q„, D) < cx)}. (17) 

n>l 

Since Theorem 4 has an exact analog in the case of random fields, we postpone its proof until the 
proof of the corresponding result (Theorem 27) in Section 6. 
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Remark 3: Suppose that the joint process [X^Y) is stationary, and that it satisfies a "process- 
level large deviations principle" (see Remark 6 in Section 6 for a somewhat more detailed statement) 
on the space of stationary probability measures on {A°° x A°°) equipped with the topology of weak 
convergence. Assume, moreover, that this LDP holds with a convex, good rate-function /(•). [See 
[22|[26, Sec. 5.3, 5.4] Sec. 6.5.3] |12] for a general discussion as well as specific examples of processes 



for which the above conditions hold. Apart from the i.i.d.case, these examples also include all ergodic 
finite-state Markov chains, among many others.] 

It is easy to check that, when p is bounded and continuous on. A x A, then with P-probability 
one, conditional on xf , the random variables {p„(x^, Yf')} satisfy the LDP upper bound with respect 
to the deterministic, convex rate- function J{D) = infl(z^), where the infimum is over all stationary 
probability measures u on A°° x such that the ^""-marginal of is P and Ei,[p[Xi,Yi)\ = D. 
Indeed, Comets [^] provides such an argument when X and Y are both i.i.d. Moreover, he shows 
that in that case the corresponding LDP lower bound also holds, and hence Theorem 4 applies. 
Unfortunately, the conditional LDP lower bound has to be verified on a case-by-case basis. 

Remark 4- Although quite strong, the -(/^^-mixing condition of Theorem 3, and the (S')-mixing 
condition of |jl^, probably cannot be significantly relaxed: For example, in the special case when X is 
a constant process taking on just a single value, if Theorem 3 were to hold (for any bounded distortion 
measure) with a strictly monotone rate-function, then necessarily the empirical measures of Yi' would 
satisfy the LDP in the space Va{A) (see for details). But [l^, Example 1] illustrates that this LDP 
may fail even when Y is a stationary ergodic Markov chain with discrete alphabet A. In particular, 
the example in |12] has an exponential 0-mixing rate. 



2.4 Generalized AEP for Optimal Lossy Compression 

Here we present a version of the generalized AEP that is useful in proving direct coding theorems. 
Let X he a. stationary ergodic process. For the distortion measure p we adopt two simple regularity 
conditions. We assume the existence of a reference letter, i.e., an d G ^ such that 

Ep,[p{Xi,a)] < oo. 



Also, following |42], we require that for any distortion level D > there is a scalar quantizer for X 
with finite rate. 

Quantization Condition: For each D > there is a "quantizer" q : A ^ B for some countable 
(finite or infinite) subset B C A, such that: 

i. p{x, q{x)) < D for all x £ A, and 

ii. the entropy H{q{Xi)) < oo. 

The following was implicitly proved in ||4^; see also ||5^ for details. 

Theorem 5. Generalized AEP for Optimal Lossy Compression ^^/; Let X be a stationary ergodic 
process. Assume that the distortion measure p satisfies the quantization condition, that a reference 
letter exists, and that for each n > 1 the infimum of 

EpJ-logQn{BiX^,D))] 

over all probability measures Qn on A^ is achieved by some Qn- Then for any D > 

-- log Qn{B{X^,D))^R{D) w.p.l (18) 
n 

where R{D) is the rate-distortion function of the process X . 
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Historical Remarks: The relevance of the quantities — log QniB{X^ , D)) to information theory was 
first suggested implicitly by Kieffer |43] and more explicitly by Luczak and Szpankowski [^]. Since 
then, many papers have appeared proving the generalized AEP under different conditions; we mention 
here a subset of those proving some of the more general results. The case of finite alphabet processes 
was considered by Yang and Kieffer |^9[. The generalized AEP for processes with general alphabets 
and Y i.i.d. was proved by Dembo and Kontoyiannis [^] and by Yang and Zhang [^]. Finally, the 

The observations of Theorem 4 



case when Y is not i.i.d. was (Theorem 3) treated by Chi [16||]r 
about the rate-function ii(P, Q, D) are new. Theorem 5 essentially comes from Kieffer's work |43]; see 
also g. 

We should also mention that, in a somewhat different context, the intimate relationship between 



the AEP and large deviations is discussed in some detail by Orey in [32|. 



2.5 Densities vs. Balls 



Let us recall the classical generalization of the AEP, due to Barron and Orey |61], to processes 
with values in general alphabets. Suppose X as above is a general stationary ergodic process with 
marginals {Pn} that are absolutely continuous with respect to the sequence of measures M = {M„}. 

Theorem 6. AEP for Processes with Densities ^[61_J: Let X be a stationary ergodic process whose 
marginals Pn have densities /„ = dPn/dMn with respect to the cr-finite measures M„, n > 1. Assume 
that the sequence M of dominating measures is Markov of finite order, with a stationary transition 
measure, and that the relative entropies 



Hn — Ep^ 



log 



n— 1\ 



have Hn > — oo eventually. Then 



1^^ dPn 

n dMr, 



-Hi 



n > 2, 



w.p.l 



(19) 



where i7(P||M) is the relative entropy rate defined as H{ 



lim„ Hn = inf„ Hn- 



The AEP for processes with densities is also know to hold when the reference measures M„ do 
not form a Markov sequence, under some additional mixing conditions (see [ |6l[| where M„ are taken 
to be non-Markov measures satisfying an additional mixing condition, and the more recent extension 
in |15| where the M„ are taken to be discrete Gibbs measures.) Moreover, Kieffer pl|p2[ | has given 
counterexamples illustrating that without some mixing conditions on {Mn} the AEP (|l^ ) fails to hold. 

There is a tempting analogy between the generalized AEP ( [I5D and the AEP for processes with 
densities (|T9|). The formal similarity between the two suggests that, if we identify the measures Q„ 
with the reference measures M„, corresponding results should hold in the two cases. Indeed, this 
does in general appear to be the case, as is illustrated by the various generalized AEPs stated above. 
Moreover, we can interpret the result of Theorem 5 as the natural analog of the classical discrete AEP 
(^) to the case of lossy data compression. As we argued in the introduction, the generalized AEPs of 
the previous sections play analogous roles in the proofs of the corresponding direct coding theorems. 

Taking this analogy further indicates that there might be a relationship between these two different 
generalizations. In particular, when n is large and the distortion level D is small, the following heuristic 
calculation seems compelling: 
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(6) _1 Pn{B{X^,D)) 
n''^Qn{B{Xf,D)) 

= --logPniB{X^,D)) + - log Qn{B{X^,D)) 

n n 



(d) 



R{F,F, D)-R{F,Q,D) 
-H{F\\q) 



where (a) holds in the Umit as n — oo by Theorem 6, (5) should hold when D is small by the 
assumption that P„ has a density with respect to Q^, (c) would follow in the limit as n — oo by an 
application of the generalized AEP, and it is natural to conjecture that (d) holds in the limits as -D J, 
by reading the above calculation backwards. 

In the following two sections we formalize the above heuristic argument in two special cases: First 
when X is a discrete process taking values in a finite alphabet, and second when X is a continuous 
process taking values in W^. 

2.5.1 Discrete Case 

Here we take X to be a stationary ergodie process taking values in a finite alphabet A, and Y to be 
i.i.d. with first order marginal distribution Q = Qi on the same alphabet A = A. Similarly we write 
P = Pi for the first order marginal of X. In Theorem 7 we justify the above calculation by showing 
that the limits as D | and as n ^ oo can indeed be taken together in any fashion: We show that 
the double limit of the central expression 

is equal to i3"(P||Q) with probability 1, independently of how n grows and D decreases to zero. Its 
proof is given in Appendix A. 

Theorem 7. Densities vs. Balls in the Discrete Case: Let X be a stationary ergodie process and 
Y be i.i.d, both on the finite alphabet A. Assume that p{x, y) = if and only ii x = y, and Q{x) > 
for all X. Then the following double limit exists: 

In particular, the repeated limit lim„lim£) exists with probability one and is equal to iJ(P||Q). 

2.5.2 Continuous Case 

Here we state a weaker version of Theorem 7 in the case when A = A = for some d>\, and when 
X is an M'^-valued, stationary ergodie process. Suppose that the marginals {-Pn} of X are absolutely 
continuous with respect to a sequence of reference measures {Qn}- Throughout this section we take 
the Qn to be product measures, Qn = Q"', for some fixed Borel probability measure Q on W^. A 
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typical example to keep in mind is when Q a Gaussian measure on M and X a real- valued stationary 
ergodic process all of whose marginals Pn have continuous densities with respect to Lebesgue measure. 

For simplicity, we take p to be squared-error distortion, p{x,y) = (x — y)^, although the proof of 
Theorem 8, given in Appendix B, may easily be adapted to apply for somewhat more general difference 
distortion measures. 

Theorem 8. Densities vs. Balls in the Continuous Case: Let X be an M'^-valued stationary 
ergodic process, whose marginals P„ have densities /„ = dPn/dQn with respect to a sequence of 



product measures Qn = Q", n > 1, for a given probability measure Q on M . Let p(x,y) 
for any x, y S Mf^. 

(a) The following repeated limit holds: 



yy 



1 PniB{Xf,D)) 
lim lim — log —-- — - — 



H{1 



w.p.l. 



(b) Assume, moreover, that X is i.i.d. with marginal distribution Pi = P on M"^, and that the 
following conditions are satisfied: Both Epxq[p{X,Y)] and Epxp[p{X,Y)] are finite and nonzero; the 
expectation 

E p[- log Q{B{X,D))] is finite for ah D > 0; 
and a (5 > exists for which 

P{B{X,D)) 



Ep 



sup 

.0<D<5 



log 



Q{B{X,D)) 



< oo. 



(21) 



Then, the reverse repeated limit also holds: 



1 PniB{Xf,D)) 

hm lim — log ^ , — -- 

DiOn^oo n ^ Q„(P(Xf,P>)) 



w.p.l. 



It is easy to check that all conditions of the theorem hold when Q is a Gaussian measure on M and 
P has finite variance and a probability density function g (with respect to Lebesgue measure) such 
that Ep(sup\y_x\<s I log5(y)l) < for some 6 > 0. For example, this is the case when both P and Q 
are Gaussian distributions on M. 

As will be seen from the proof of the theorem, although we are primarily interested in the case 
when the relative entropy rate ff(P||Q) is finite, the result remains true when i7(P||Q) = oo, and in 



that case assumption (21) can be relaxed to 

Ep 



, Q(B{X,D)) 



< oo. 



Finally we note that, in the context of ergodic theory, Feldman developed a different verison 
of the generalized AEP, and also discussed the relationship between the two types of asymptotics (as 
n — >■ oo, and as D [ 0). 



3 Applications of the Generalized AEP 

As outlined in the introduction, the generalized AEP can be applied to a number of problems in data 
compression and pattern matching. Following along the lines of the corresponding applications in 
the lossless case, below we present applications of the results of the previous section to: 1. Shannon's 
random coding schemes; 2. mismatched codebooks in lossy data compression; 3. waiting times between 
stationary processes (corresponding to idealized Lempel-Ziv coding); 4. practical lossy Lempel-Ziv 
coding for memoryless sources; and 5. weighted codebooks in rate-distortion theory. 
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3.1 Shannon's Random Codes 



Shannon's well-known construction of optimal codes for lossy data compression is based on the idea 
of generating a random codebook. We review here a slightly modified version of his construction [ |69[ 
and describe how the performance of the resulting random code can be analyzed using the generalized 
AEP. 

Given a sequence of probability distributions Qn on A", n > 1, we generate a random codebook 
according to the measures Qn as an infinite sequence of i.i.d. random vectors 

with each Y^{i) having distribution Qn on Suppose that, for a fixed n, this codebook is available 
to both the encoder and decoder. Given a source string X" to be described with distortion D or less, 
the encoder looks for a D-close match of Xf into the codebook {Y{^{i) ; i > 1}. Let in be the position 
of the first such match 

in= inf{i> 1 : pn{X^,Y{'ii})<D} 

with the convention that the infimum of the empty set equals +00. If a match is found, then the 
encoder describes to the decoder the position z„ using Elias' code for the integers |2^. This takes no 
more than 



log2 in + 2 log2 log2 in + Coust. bits. (22) 

If no match is found (something that asymptotically will not happen, with probability one), then the 
encoder describes with distortion D or less using some other default scheme. 

Let £n{X^) denote the overall description length of the algorithm just described. In view of (p^), 
in order to understand its compression performance, that is, to understand the asymptotic behavior 
of £n{Xi), it suffices to understand the behavior of the quantity 

log2 in, for large n. 

Suppose that the probability Qn{B{Xi , D)) of finding a D-close match for in the codebook is 
nonzero. Then, conditional on the source string X", the distribution of in is geometric with parameter 
Qn{B{Xi , D)). From this observation is easy to deduce that the behavior of i„ is closely related to 
the behavior of the quantity l/Qn{B{Xl\D)). The next theorem is an easy consequence of this fact 



so it is stated here without proof; see the corresponding arguments in ||50|] ||5l[ . 

Theorem 9. Strong Approximation: Let X be an arbitrary process and let {Qn} be a given sequence 
of codebook distributions. If Qn{B{X^ , D)) > eventually with probability one, then for any e > 0: 

log2^n < — log2 <5n(-B(X", D)) + log2 log2 n + 3 eventually, w. p. 1 
and log2 in > - log2 Qn{B{X'^, D)) - logg n - (1 + e) log2 log2 n eventually, w.p.l. 



The above estimates can now be combined with the results of the generalized AEP in the previous 
section to determine the performance of codes based on random codebooks with respect to the "opti- 
mal" measures Qn- To illustrate this approach we consider the special case of memoryless sources and 
finite reproduction alphabets, and show that the random code with respect to (almost) any random 
codebook realization is asymptotically optimal, with probability one. Note that corresponding results 
can be proved, in exactly the same way, under much more general assumptions. For example, utilizing 
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Theorem 5 instead of Theorem 1 we can prove the analog of Theorem 10 below for arbitrary stationary 
ergodic sources. 

Let X be an i.i.d. source with marginal distribution Pi = P on A, and take the reproduction 
alphabet A to be finite. For simplicity we will assume that the distortion measure p is bounded, i.e., 
sup^ yp{x,y) < DO, and we also make the customary assumption that 

supminp(x,y) = 0. (23) 

[See the remark at the end of Section 5.1.1 for a discussion of this condition and when it can be 
relaxed.] As usual, we define the rate-distortion function of the memoryless source X by 

R(D) = inf I(X:Y) 

where the infimum is over all jointly distributed random variables {X, Y) with values in A x A, such 
that X has distribution P and E[p{X,Y)] < D. Let 

D= mmEp[p{X,y)] (24) 
and note that R{D) = for D > D. To avoid the trivial case when R{D) = for all D, we assume 



that D > Q and we restrict our attention to the interesting range of values D G (0, D). Recall [31|[50 
that for any such D, R{D) can alternatively be written as 

R{D) = \niRi{P,Q,D) 
Q 

where the infimum is over all probability distributions Q on A. Since we take A to be finite, this 
infimum is always achieved (see |5^) by a probability distribution Q = Q* . To avoid cumbersome 
notation in the statements of the coding theorems given next and also in later parts of the paper, we 
also write 7^(-D) for the rate-distortion function of the source X expressed in hits rather than in nats: 

n{D) = {\og^e)R{D). 

Finally, we write Q* for the product measures (Q*)" and call {Qn} the optimal reproduction distribu- 
tions at distortion level D. 

Combining Theorem 9 with the generalized AEP of Theorem 1 implies the following strengthened 
direct coding theorem. 



Theorem 10. Pointwise Coding Theorem for I.I.D. Sources J5q/." Let X be an i.i.d. source with 
distribution P on A, and let Q* denote the optimal reproduction distributions at distortion level 
D G (0,1?). Then the codes based on almost any realization of the Shannon random codebooks 
according to the measures {Qn} have codelengths iniX^) satisfying: 

lim -iniX^) = Tl{D) bits per symbol, w.p.l. 

n— >oo n 

A simple modification of the above scheme can be used to obtain universal codebooks that achieve 
optimal compression for any memoryless source: Given a fixed block-length n, we consider the col- 
lection of all n-types on A, namely, all distributions Q of the form Q{a) = j/n, < j < n, for 
a E A. Instead of generating a single random codebook according to the optimal distribution Q^, we 
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generate multiple codebooks, one for each product measure Q"^ corresponding to an n-type Q on A. 
Then we (as the encoder) adopt a greedy coding strategy. We find the first -D-cIose match for X" in 
each of the codebooks, and pick the one in which the match appears the earhest. To describe X" to 
the decoder with distortion D or less we then describe two things: (a) the index of the codebook in 
which the earhest match was found, and (b) the position in of this earhest match. Since there are 
at most polynomiahy many n-types (cf. [^[^), the rate of the description of (a) is asymptoticahy 
neghgible. Moreover, since the set of n-types is asymptotically dense among probability measures on 
A, we eventually do as well as if we were using the optimum codebook distribution Q* . 



Theorem 11. Pointwise Universal Coding Theorem [SLJ: Let X be an arbitrary i.i.d. source with 
distribution P on A, let R{D) be the rate-distortion function of this source at distortion level D G 
(0, -D), and let TZ{D) denote its rate-distortion function in bits. The codes based on almost any 
realization of the universal Shannon random codebooks have codelengths in{Xf) satisfying: 

lim -iniX^) = Tl{D) bits per symbol, w.p.l. 

n— >oo n 

3.2 Mismatched Codebooks 

In the last section we described how, for memoryless sources, the Shannon random codebooks with 
respect to the optimal reproduction distributions can be used to achieve asymptotically optimal com- 
pression performance. In this section we briefly consider the question of determining the rate achieved 
when an arbitrary (stationary ergodic) source X is encoded using a random codebook according to 
the i.i.d. distributions Q^, for an arbitrary distribution Q on A. For further discussion of the problem 
of mismatched codebooks see [56|[37|[53|[^| and the references therein. 



The following theorem is an immediate consequence of combining Theorem 1 with Theorem 9 and 
the discussion in Section 3.1 (see also Example 1 in Section 2.2). 

Theorem 12. Mismatched Coding Rate: Let X be a stationary ergodic process with marginal 
distribution Pi = P on A, let Q be an arbitrary distribution on A^ and define -Dmin and Z^av as in 
Section 2.2. 

(a) Arbitrary I.I.D. Codebooks: For any distortion level D G (DmiiD -Dav), the codes based on al- 
most any realization of the Shannon random codebooks according to the measures {Q^} have 
codelengths £„(X") satisfying: 

lim -ln{Xi) = {[og2e)Ri{P,Q,D) bits per symbol, w.p.l. 

n— >oo n 

(b) I.I.D. Gaussian Codebooks: Suppose p{x, y) = (x — y)'^ and X is a real- valued process with finite 
variance o"^ = Var(Xi). Let Q be the A^(0,r^) distribution on M. Then for any distortion level 
D G (0, cr^ + 7"^), the codes based on almost any realization of the Gaussian codebooks according 
to the measures {Q"} have codelengths £n{X'i) satisfying: 

lim -laiX]") = I- log2 (^) - (log2 e)^^^ — — — bits per symbol, w.p.l, 



where 

A 1 r 

"=2 
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Lossless vs. Lossy Mismatch: Recall that, in the case of lossless data compression, if instead of 
the true source distribution P a different coding distribution Q is used, then the code-rate achieved is 



H{P)+H{P\\Q). (25) 

Similarly in the current setting of lossy data compression, if instead of the optimal reproduction 
distribution Q* we use a different codebook distribution Q, the rate we achieve is Ri{P,Q,D). An 
upper bound for Ri{P,Q, D) is obtained by taking (X,Y) in the expression of Remark 1 to be the 
jointly distributed random variables that achieve the infimum in the definition of the rate-distortion 
function of P. Then the (mismatched) rate of the random code based on Q instead of Q* is: 

Ri{P,Q,D)<R{D) + H{Q*\\Q). (26) 



Equations ( ^5[) and (26) illustrate the analogy between the penalty terms in the lossless and lossy case 
due to mismatch. 

Next we discuss two special cases of part (b) of the theorem that are of particular interest. 

Example 2: Gaussian codebook with mismatched distribution: Consider the following coding sce- 
nario: We want to encode data generated by an i.i.d. Gaussian process with A^(0, o"^) distribution, 



with squared-error distortion D or less. In this case, it is well-known [|[[20| that for any D £ (0,cJ^) 
the optimal reproduction distribution Q* is the A^(0, cj^ — D) distribution, so we construct random 
codebooks according to the i.i.d. distributions Q* = (Q*)"'. 

But suppose that, instead of an i.i.d. Gaussian, the source turns out to be some arbitrary stationary 
ergodic X with zero mean and variance a^. Theorem 12 (b) implies that the asymptotic rate achieved 
by our i.i.d. Gaussian codebook is equal to 

■ ( TT ) ^'^^^ P^^' symbol. 



2 ^V^. 

Since this is exactly the rate-distortion function of the i.i.d. A^(0, cj^) source, we conclude that the rate 
achieved is the same as what we would have obtained on the Gaussian source we originally expected. 
This offers yet another justification of the folk theorem that the Gaussian source is the hardest one to 
compress, among sources with a fixed variance. In fact, the above result is a natural fixed-distortion 
analog of |^3|, Theorem 3] . 

Example 3: Gaussian codebook with mismatched variance: Here we consider a different type of 
mismatch. As before, we are prepared to encode an i.i.d. Gaussian source, but we have an incorrect 
estimate of its variance, say ct^ instead of the true variance cr^. So we are using a random codebook 
with respect to the optimal reproduction distribution Q* = (Q*)", where Q* is the N{0,a'^ — D) 
distribution, but the actual source is i.i.d. A^(0, cr^). In this case, the rate achieved by the random 
codebooks according to the distributions Q* is given by the expression in Theorem 12 (b), with 
replaced hy — D. Although the resulting expression is somewhat long and not easy to manipulate 
analytically, it is straightforward to evaluate numerically. For example. Figure 1 shows the asymptotic 
rate achieved, as a function of the error e = cr^ — in the estimate of the true variance. As expected, 
the best rate is achieved when the codebook distribution is matched the source (corresponding to 
e = 0), and it is equal to the rate-distortion function of the source. Moreover, as one might expect, it 
is more harmful to underestimate the variance than to overestimate it. 
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Error "e" in the variance 



Figure 1: This graph shows the rate achieved by an i.i.d. Gaussian codebook of variance — D when 
apphed to i.i.d. A^(0, cr^) data. The rate is shown as a function of the error e = cj^ — o"^ in the variance 
estimate. In this particular example: = 2, D = 1, the error e ranges from —1/2 to 1/2, and the 
rate-distortion function of the source equals 0.5 bits/symbol. 

3.3 Waiting Times and Idealized Lempel-Ziv Coding 

Given D > and two independent realizations from the stationary ergodic processes X and Y, our 
main quantity of interest here is the waiting time Wn = Wn{D) until a I?-close version of the initial 
string X" first appears in Y^. Formally 

Wn = inf{i > 1 : pn{X^,Yi+^-') < D} (27) 

with the convention, as before, that the infimum of the empty set equals +oo. 

The motivation for studying the asymptotic behavior of Wn for large n is twofold. 

Idealized Lempel-Ziv coding. The natural extension of the idealized scenario described in the 
introduction is to consider a message X" that is to be encoded with the help of a database Yf^ . The 
source and the database are assumed to be independent, and the database distribution may or may 
not be the same as that of the source. In order to communicate X" to the decoder with distortion D 
or less, the encoder simply describes Wn, using no more than 

log2Wn + 0(log2log2W„) bitS. 

Therefore, the asymptotic performance of this idealized scheme can be completely understood in terms 
of the asymptotics of log W„, for large n. 

DNA pattern matching. Here we imagine that X" represents a DNA or protein "template," and 
we want to see whether it appears, either exactly or approximately, as a contiguous substring of a 
database DNA sequence Yf^. We are interested in quantifying the "degree of surprise" in the fact that 
a -D-close match was found at position Wn. Specifically, was the match found "atypically" early, or is 
the value of Wn consistent with the hypothesis that the template and the database are independent? 
For a detailed discussion, see, e.g., m. Section 3.2]||l|l|§ and the refer ences therein. 
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If for a moment we consider the case when both X and Y are i.i.d, we see that the waiting time 
Wn is, at least intuitively, closely related to the index in of Section 3.1. As the following result shows, 
although the distribution of Wn is not exactly geometric, Wn behaves very much like in, at least in 
the exponent. That is, the difference 

logWn-[-logQn{B{X^,D))] 

is "small," eventually with probability one. 

Recall the definition of ^-mixing from Section 2.3, and also the definition of the i;^)-mixing coeffi- 
cients of Y 

m = sup{|Q(i?|A) - Q(i?)| : B G aivn, A G Q(A) > 0} 

where, as before, (y{Yi) denotes the cr-field generated by Y? . The process Y is called (p-mixing if 
(j>{k) — > as A; — > oo; see Q for an extensive discussion of (/(-mixing and related mixing conditions. 



Theorem 13. Strong Approximation j^6]f^^: Let X and Y be stationary ergodic processes, and 
assume that Y is either -i/'-niixing or (/)-mixing with summable i;^-mixing coefficients, X^fc>i <P{k) < oo. 
If Qn{B{Xf, D)) > eventually with probability one, then for any e > 0: 

-(1 + e) logn < log[WnQn{B{Xl\ D))] < {2 + e) logn eventually, w.p.l. 

Theorem 13 of course implies that 

logWn = -logQn{B{X'^,D)) + 0{\ogn) w.p.l (28) 

and combining this with the generalized AEP statements of Theorems 1 and 4 we immediately obtain 
the first order (or strong-law-of-large- numbers, SLLN) asymptotic behavior of the waiting times Wn- 

Theorem I4. SLLN for Waiting Times: Let X and Y be stationary ergodic processes. 

(a) If Y is i.i.d. and the average distortion Dav is finite, then for any D G (Dmin, -Dav) 

-logWn^ Ri{Pi,Qi,D) w.p.l. (29) 
n 

(b) If Y is -(/^-mixing and the distortion measure p is bounded, then for any D G (-Dmirn ^av) 

-logWn^ R{r,q,D) w.p.l. (30) 

n 

Note that similar results can be obtained under different assumptions on the process Y, using 
Theorems 3 and 5 in place of Theorems 1 and 4 as done above. When X is taken to be an arbitrary 
stationary ergodic process, it is natural to expect that the mixing conditions for Y in Theorem 14 (b) 
cannot be substantially relaxed. In fact, even in the case of exact matching between finite-alphabet 
processes. Shields ||7^ has produced a counterexample demonstrating that the analog of Theorem 13 
does not hold for arbitrary stationary ergodic Y. 

Historical Remarks: Waiting times in the context of lossy data compression were studied by 



Steinberg and Gutman |71| and Luczak and Szpankowski |55|. Yang and Kieffer |79| identified the 
limiting rate-function for a wide range of finite alphabet sources, and Dembo and Kontoyiannis [23 
and Chi [f^] generalized these results to processes with general alphabets. 
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The strong approximation idea was introduced in |^] in the case of exact matching. For processes 
Y with summable (/>-mixing coefficients, Theorem 13 was proved in p^ , and when Y is ^/^-mixing it 
was proved, for the case of no distortion, in |46|. Examining the latter proof, |16| observed that it 



immediately generalizes to the statement of Theorem 13. 

Related results were obtained by Kanaya and Muramatsu [38|, who extended some of the results 



of to processes with general alphabets, and by Koga and Arimoto Q who considered non- 



overlapping waiting times between finite-alphabet processes and Gaussian processes. Finally, Shields 



[70 1 and Marton and Shields [56| considered waiting times with respect to Hamming distortion and 



for X and Y having the same distribution over a finite alphabet. For the case of small distortion they 



showed, under some conditions, that approximate matching results like (^9|) and (|30|) reduce to their 
natural exact matching analogs as — > 0. 

3.4 Match-Lengths and Practical Lempel-Ziv Coding 

In the idealized coding scenario of the previous section we considered the case where a fixed-length 
message X" is to be compressed using an infinitely long database Yf°. But, in practice, the reverse 
situation is much more common: We typically have a "long" message {Xi,X2, ■ ■ ■) to be compressed, 
and only a finite-length database is available to the encoder and decoder. It is therefore natural 
(following the corresponding development in the case of lossless compression) to try and match "as 
much as possible" from the message {Xi,X2, ■ ■ .) into the database Y{^. With this in mind we define 
the match-length as the length £ of the longest prefix X^ that matches somewhere in the database 
with distortion D or less: 

Lm = sup{£ > 1 : piiXl < D, for some j = 1, 2, . . . , m}. (31) 

Intuitively, there is a connection between match-lengths and waiting times. Long matches should 
mean short waiting times, and vice versa. In the case of exact matching this connection was precisely 
formalized by Wyner and Ziv [^] , who observed that the following "duality" relationship always holds: 

Wn<m <^ L^>n. (32) 

This is almost identical to the standard relationship in renewal theory between the number of events 



by a certain time and the time of the nth event (see, e.g., |^). Wyner and Ziv |£5[ utilized (|32D to 
translate their first order asymptotic results about Wn to corresponding results about Lm- 

Unfortunately this simple relationship no longer holds in the case of approximate matching, when 



a distortion measure is introduced. Instead, the following modified duality was employed in [23| to 
obtain corresponding results in approximate matching and lossy data compression: 

Wn < m =^ Lm > ri and Lm > n ^ inf Wk < m. (33) 

k>n 



In [^3[ it is shown that ( |33| ) can be used to deduce the asymptotic behavior of Lm from that of Wn, 
but this translation is not straightforward anymore. In fact, as we discuss in Section 5.2, a somewhat 
more delicate analysis is needed in this case. Nevertheless, once the behavior of the waiting times is 
understood, the first implication in ( [3^ ) immediately yields asymptotic lower bounds on the behavior of 
the match-lengths. This is significant for data compression since long match-lengths usually mean good 
compression performance. Indeed, this observation allowed ||4^ to introduce a new lossy version of the 
Lempel-Ziv algorithm that achieves asymptotically optimal compression performance for memoryless 
sources. The key characteristics of the algorithm are that it has polynomial implementation complexity, 
and that it achieves redundancy comparable to that of its lossless counterpart, the FDLZ [[T^. 
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We also mention that, before several practical (yet suboptimal) lossy versions of the Lempel- 



Ziv algorithm were introduced, perhaps most notably by Steinberg and Gutman |71] and Luczak 
and Szpankowski |^5|. Roughly speaking, the reason for their suboptimal compression performance 
was that the coding was done with respect to a database that had the same distribution as the 
source. In view of the discussion in the previous section, it is clear that the asymptotic code-rate 
of these algorithms is Ri{P, P, D), which is typically significantly larger than the optimal R{D) = 
mlQ Ri{P,Q,D); see or for more detailed discussions. 



3.5 Weighted Codebooks and Sphere-Covering 

Here we describe a related question that was recently considered in ||4^. In the classical rate-distortion 
problem, one is interested in finding "efficient" codebooks for describing the output of some random 
source to within some tolerable distortion level. In terms of data compression, a codebook is "efficient" 
when it contains relatively few codewords, so that it yields a code with a low rate. Here we are 
interested in the more general problem of finding codebooks with small "mass." 

Let X be an i.i.d. process with marginal distribution P on a finite alphabet A, and take A = A and 
p a distortion measure with the property that p{x, y) = if and only if x = y. Let M : ^ — > (0, oo) 
be an arbitrary nonnegative function assigning mass M^(Cn) to subsets C„ of A^: 

n 

ylf^Cn y^dCn 1 = 1 

The question of interest here can be stated as follows. Let C„ be a subset A" (we think of C„ as 
the codebook) that nearly D-covers all of i.e., with high probability, every string X" generated 
by the source will match at least one element of C„ with distortion D or less: 

P"{there is an y'l G C„ such that pn{X^,yi ) < D] ^ I. (34) 



If (34) holds, how small can the mass of Cn be? 

For example, taking M identically equal to one, this problem reduces to the rate-distortion ques- 
tion. Taking M to be a different probability measure Q, it reduces to the classical hypothesis testing 
question, whereas M = P (the source distribution) yields "converses" to some measure-concentration 



inequalities; see [48| for a detailed treatment of these and more general cases. 

The next result characterizes the best growth exponent for the mass of an arbitrary codebook C„. 

Theorem 15: Weighted Codebooks J^^: Let X be an i.i.d. source on the finite alphabet A = A, 
and suppose that p{x, y) = if and only if x = y. 

(<^) Let Cn be an arbitrary subset of A^, and write D for the expected distance of a source string 
Xf from Cn- 

D = Epn[mm /9„(Xr, y^^)]. 

Then 

M"(C7„) > e"^(^) 

where the rate- function r{D) = r(D;P, M) is defined by 

r{D) = r{D; P, M) = inf {/(X; Y) + E[\og M{Y)]} 

and the infimum is taken over all jointly distributed random variables (X, Y) with values in A, 
such that X ~ P and E[p{X, Y)] < D. 
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(^) For every D >0 there is a sequence of codebooks {C*} such that 

limsup - log M"(C*) < r{D) 

n— >oo IT' 

and hmsup £'pn[ min < D. 

The main ingredient in the proof of the direct coding theorem in part (^) above is provided by 
yet another version of the generahzed AEP. Let {X* , Y*) be a pair of random variables achieving the 
infimum in the definition of r{D), and let Q* be the distribution of Y* . Now for (5 > and n > 1 
define the sets 

Qn = {Vi G A" : Py^iib) < Q*{b) + 6, V6 e A} 
where Py^ denotes the empirical distribution induced by y" on A. For each n > 1 define the "con- 

(c) 

ditioned" measure Qn on A^ by conditioning the product measure (Q*)" to the set Qn- The next 
theorem provides the necessary version of the generalized AEP in this case. 

Theorem 16: Generalized AEP for Conditioned Measures ^^/; With the conditioned measures 
Qn^ defined as above, we have: 

limsup-- log g(f)(B(Xi",D)) < I{X*;Y*) w.p.l. 

n— >oo n 

4 Refinements of the Generalized AEP 

As we saw in Section 3, the generalized AEP can be used to determine the first order asymptotic 
behavior of a number of interesting objects arising in applications. For example, the generalized AEP 
of Theorem 1 

--\ogQ^{B{X^,D)) ^ Ri{P,Q,D) w.p.l 
n 

immediately translated (via the strong approximation of Theorem 13) to a strong-law-of-large-numbers 
(SLLN) result for the waiting times: 

-logWn^ Rl{P,Q,D) W.p.l. 

n 

In this section we will prove refinements to the generalized AEP of Section 2.2, and in Section 5 we 
will revisit the applications of the previous section and use these refinements to prove corresponding 
second order asymptotic results. 

To get some motivation, let us consider for a moment the simplest version of the classical AEP, for 
an i.i.d. process X with distribution P on the finite alphabet A. The AEP here follows by a simple 
application of the law of large numbers, 

1 1 " 

- - log = - V[- log P{X,)] ^ H (35) 

n n ^-^ 

i=l 

where H is the entropy of P. But (^) contains more information than that: It says that — log P"(X") 
is in fact equal to the partial sum S'„ = X]r=i -^i °f i.i.d. random variables Zi = —\ogP{Xi). 
Therefore we can apply the central limit theorem (CLT) or the law of the iterated logarithm (LIL) to 
get more precise information on the convergence of the AEP. 
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The same strategy can be carried out for non-i.i.d. processes: Initially Ibragimov |^6[ and then 
Philipp and Stout |^5| showed that even when X is a Markov chain, or, more generally, a weakly 
dependent random process, the quantities — logP"(X") can be approximated by the partial sums of 
an associated weakly dependent process. These results have found a number of applications in lossless 
data compression and related areas [^] [^] . 

In this and the following section we will carry out a similar program in the lossy case. Throughout 
this section we will adopt the notation and assumptions of Section 2.2: Let X be a stationary ergodic 
process with first order marginal Pi = P on A, and let Q be an arbitrary probability measure on A. 
Define -Dmin and I?av, as before (as in equations (|^) and (^)), and assume that -Dmin < -Dav so that the 
distortion measure p{X, Y) is not essentially constant in Y with positive probability. We also impose 
here the additional assumption that p has a finite third moment: 

Ds = Ep^q[p^{X,Y)]<^. (36) 

The first result of this section refines Theorem 1 by giving a more precise asymptotic estimate of 
the quantity — log Q"(P(X", D)) in terms of the rate- function Pi(P, Q, D) and the empirical measure 
Pn induced by on A" 

n ^ 

1=1 

where bx denotes the measure assigning unit mass to x G A. 



Theorem 17: ^^j: Let X be a stationary ergodic process with marginal P on A, and let Q be 
an arbitrary probability measure on A. Assume that D3 = Epxq{p^{X,Y)\ is finite. Then for any 

D G (Amn,^av): 

-logQ"(S(Xr,L>)) = nPi(P„,Q,D) + ^logn + 0(l) w.p.L (37) 



Next we show that the most significant term in (^) can be approximated by the partial sum of a 
weakly dependent random process. Recall the definition of the a-mixing coefficients of X 

a{k) = sup{|P(AnP) -P(^)P(P)| : A e a{X^^), B e a{X^)} 

where cr{X^) is the cr-field generated by Xj! . The process X is called a-mixing if a{k) — > as A; — > 00; 
see !§] for more details. 

We also need to recall some of the notation from the proof of Theorem 1 in Section 2.2. For x £ A 
and A G M, let Aa;(A) denote the log-moment generating function of the random variable p{x,Y) 

A.(A)^logi?Q(e^''(^-^)) 

and note that the function A(A) defined in (p!o|) can be written as A(A) = Ep[Ax{X)]- Also recall that 
for any D G (-Dmin, -Dav) there exists a unique A* < such that A'(A*) = D. 
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Theorem 18: ^jJ: Let X be a stationary a-mixing process with marginal P on A, and let Q be 
an arbitrary probability measure on A. Assume that the a-mixing coefficients of X satisfy 



^ a\k) < oo, for some t G (0, 1/3) (38) 

k=l 

and that = Epxq[p^{X,Y)] is finite. Then for any D G (-Dmirn -^av): 

n 

nRi{Pn,Q,D)=nRi{P,Q,D) + Y,g{Xi)+0{loglogn) w.p.l 

where 



i=l 



5(x) = A(A*)-A,(A*), xgA. 



(39) 



Theorem 18 is a small generalization of ||2^, Theorem 3]. Before giving its proof outline, we combine 
Theorems 17 and 18 to show that, as promised, — log Q"'{B{Xi , D)) can be accurately approximated 
as the partial sum of the weakly dependent random process {g{Xn)}. 

Corollary 19: Second Order Generalized AEP: Let X be a stationary a-mixing process with 
marginal P on A, and let Q be an arbitrary probability measure on A. Assume that the a-mixing 
coefficients of X satisfy p^ ) and that D-^ = Epy_Q[p^{X,Y)\ is finite. Then for any D G (i^mim ^av)j 
and with g{x) defined as in (^9|): 

1 
2 



- log D)) = nRi{P, Q,D)+Y, oi^^) + ^ log n + 0(log log n) w.p.l. 



i=l 



Proof Outline for Theorem 18: Adapting the argument leading from (22) to (24) of |23], one easily 
checks that the result of Theorem 18 holds as soon as 



liminf inf Bn{0) > w.p.l 
n^oo |6»|<<5 



and lim sup 



nAi 



log log n 



< cxD w.p.l 



(40) 
(41) 



where An = Sfc=i Cfc is the empirical mean of the centered random variables Cfc = ^x^C-^*) ~ 
and Bn{0) is the empirical mean of the non- negative random variables A^ (A* + 6). By the ergodic 
theorem we have, with probability one, 

1 " 

liminf inf S„ (6*) > liminf - V inf A't (A* + 61) 

n^oo |6I|<<5 n^oo n \e\<S 

k=l 



Ep 



inf A'UX* + e) 

\e\<5 



and by Fatou's lemma and the continuity of the map 9 i— > A^(A* + 9) it follows that 



lim mf Ep 

510 



inf A'Uy + 9) 

\e\<5 



> Sp[A^(A*)] = A"(A*) >0. 
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This implies that (|40|) holds once (5 > is made small enough. [Note that the above argument 
also avoids an incorrect - but also unnecessary - application of the uniform ergodic theorem in the 
derivation of p^, e g. (26)].] 

Turning to (pl), since A* < 0, it follows by the convexity of A2;(A) that that for any x & A: 

0<K{y)<K{0) = EQ[p{x,Y)]. 

Consequently, Holder's inequality and assumption ( |36| ) imply that the random variable 

\Ck\ < EQ[p{Xk,Y)\Xk] + D 

has a finite third moment. Recall that the LIL holds for the partial sum An of a zero-mean, 
stationary process {Cfc} with a finite third moment, as soon as the a-mixing coefficients of {Ca;} satisfy 
(38). The observation that is a deterministic function of for all k completes the proof. □ 



5 Applications — Second Order Results 

Here we revisit the applications considered in Section 3, and using the "second order generalized 
AEP" of Corollary 19 we prove second order refinements for many of the results from Section 3. In 
Section 5.1 we consider the problem of lossy data compression in the same setting as in Section 3.1. 
We use the second order AEP to determine the precise asymptotic behavior of the Shannon random 
codebooks, and show that, with probability one, they achieve optimal compression performance up to 
terms of order (logn) bits. Moreover, essentially the same compression performance can be achieved 
universally. For arbitrary variable-length codes operating at a fixed rate level, we show that the rate 
at which they can achieve the optimal rate of nTZ{D) bits is at best of order 0{y/n) bits. This is 
the best possible redundancy rate as long as the "minimal coding variance" of the source is strictly 
positive. For discrete i.i.d. sources, a characterization is given of when this variance can be zero. 

In Section 5.2 we look at waiting times, and we prove a second order refinement to Theorem 14, 
and in Section 5.3 we consider the problem of determining the asymptotic behavior of longest match- 
lengths. As discussed briefly in Section 3.4, their asymptotics can be deduced from the corresponding 
waiting-times results via duality. 

5.1 Lossy Data Compression 

5.1.1 Random Codes and Second Order Converses 

Here we consider the exact same setup as in Section 3.1: An i.i.d. source X with distribution P on ^ is 
to be compressed with distortion D or less with respect to a bounded distortion measure p, satisfying. 



as before, the usual assumption (23) - see the remark at the end of this section for its implications. 
We take the reproduction alphabet A to be finite, define D as in (|^), and assume that D > 0. 

For D S (0, D), let Q* , n > 1, denote the optimal reproduction distributions at distortion level D. 
Combining the strong approximation Theorem 9 with the second order generalized AEP of Corollary 19 
and the discussion in Section 3.1 yields: 
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Theorem 20: Pointwise Redundancy for I.I.D. Sources j^/; Suppose X is an i.i.d. source with 
distribution P on A, and with rate-distortion function TZ{D) (in bits). Let denote the optimal 
reproduction distributions at distortion level D £ {0,D), and define the function h{x) = (log2 e)g{x), 
X £ A, with g defined as in (|39|). Then: 

(a) The codes based on almost any realization of the Shannon random codebooks according to the 
measures {Q^} have codelengths satisfying 

n 

iniXi) < nn{D) + ^ h{Xi) + 41ogn bits, eventually, w.p.l. 

i=l 

(b) The codes based on almost any realization of the universal Shannon random codebooks have 
codelengths in{Xi) satisfying 

n 

ta{Xi) < nn{D) + ^ h{X.i) + (4 + |i|) logn bits, eventually, w.p.l. 

1=1 

We remark that the coefficients of the (log n) terms in (a) and (b) above are not the best possible, 
and can be significantly improved; see for more details. 

Perhaps somewhat surprisingly, it turns out that the performance of the above random codes is 
optimal up to terms of order (logn) bits. Recall that a code Cn operating at distortion level D > is 
defined by a triplet (-B„, -^n) where: 

(a) Bn is a subset of called the codebook, 

(b) 4>n '■ A^ — > Bn is the encoder^ 

(c) V'n '■ Bn — > {0, 1}* is a uniquely decodable map, 
such that 

Pn{x1, (l)n{xi)) < D, for all a;^ G A". 
The codelengths ^„(X") achieved by such a code are simply: 

Inix'i) = length of [iPniM^i))] bits. 

Theorem 21: Pointwise Converse for I.I.D. Sources ]^/ .- Let X be an i.i.d. source with distribution 
P on A, and let {Cn} be an arbitrary sequence of codes operating at distortion level D G (0, D), with 
associated codelengths Then: 

n 

^n{Xi ) > nn{D) + ^ h{Xi) - log n bits, eventually, w.p.l 

i=l 

where h{x) is defined as in Theorem 20. 

The proof of Theorem 21 in [^0| uses techniques quite different to those developed in this paper. 
In particular, the key step in the proof is established by an application of the generalized Kuhn- Tucker 
conditions of Bell and Cover ||5| . 

Theorems 20 and 21 are next combined to yield "second order" refinements to Shannon's classical 
source coding theorem. For a source X as in Theorem 21 and a D £ (0, D), the minimal coding 
variance a"^ = (T^(P, D) of source P at distortion level D is 

cj2 = ct2(P, D) = Var[/i(Xi)] (42) 

with h{x) as in Theorem 20. 
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Theorem 22: Second Order Source Coding Theorems j5^/; Let X be an i.i.d. source with distribu- 
tion P on ^4 and with rate-distortion function TZ{D) (in bits). For D E (0, D): 

(CLT) There is a sequence of random variables Gn = Gn{P,D) such that, for any sequence of 
codes {Cn,in} operating at distortion level D, we have 

- nn{D) > ^/nGn bits, eventually, w.p.l (43) 

and the Gn converge in distribution to a Gaussian random variable 

Gn^N{0,a^) 
where (P, D) is the minimal coding variance. 

(LIL) With fj^ as above, for any sequence of codes {Gn,in} operating at distortion level D: 

limsup — > a w.p.l 

n^oo v2nloglogn 

. in{X^)-nn{D) 
limmf — — > —a w.p.l. 

n^oo ^J2n log log n 

(=^) Moreover, there exist codes {Gn,in} operating at distortion level D, that asymptotically 
achieve equality universally in all these lower bounds. 



Remark on Assumption (^Sj): When the distortion measure does not satisfy assumption (23) [as, 
for example, when p{x,y) = {x — y)^ with ^ = R and A a finite subset of M], we can modify p to 
P'{x, y) = p{x, y) - f{x), with f{x) — T^^^y^A P(^'y)' ^° thai p' satisfies (^3|). Then, to generate codes 
operating at distortion level D with respect to p, we can construct random codebooks for as before 
but do the encoding with respect to p'{x, y) at the random distortion level D„ = D — Ep {f{X)). It is 
not hard to check that Theorem 2] can be extended to apply when D is replaced by the sequence 
{Dn}- Since D — Ep{f{X)) as n ^ oo, this results with the first order approximation 

-i log QUBiX^,D^)) « <'(P„, Q*,Dn). 
Simple algebra then shows that 

R({Pn,Q*,Dn)=R'({Pn,Q\D) 

implying that all the results of Section 5.1.1 remain valid [despite the fact that p does not satisfy 
(23)], with the function h{-) taken in terms of the log-moment generating function of the 



original distortion measure p (and not that of the modified p'). 
5.1.2 Critical Behavior 

In view of Theorems 20 and 21 above, the codelengths £* (Xf ) of the best code operating at distortion 
level D have: 

n 

C(Xr) ~ nn{D) + h{Xi) + 0(log n) bits. 

i=l 

This reveals an interesting dichotomy in the behavior of the "pointwise" redundancy of the best code: 
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• Either the minimal coding variance (recall (^)) is nonzero, in which case the best rate at 
which optimality can be achieved is of order yjn bits by the CLT; 

• or fj^ = 0, and the best redundancy rate is of order (logn) bits (cf. [^). 

Under certain conditions, in this section we give a precise characterization of when each of these two 
cases can occur. Before stating it, we briefly discuss two examples to gain some intuition. 

Example 4-' Lossless Compression: Lossless data compression can be considered as an extreme case 
of lossy compression, where X is an i.i.d. source with distribution P on a finite set A = A, and the 
distortion level D is set to zero. Here it is well-known that (ignoring the integer length constraints) 
the best code is given by the idealized Shannon code, ^^(^f) = — log2 -P"(^f )• Accordingly, the 
upper and lower bounds of Theorems 21 and 22 say that the best code has codelengths 

n 



i=l 



where T~i{P) is the entropy of P in bits, and with 



A 



h{x) = - log2 P{x) - n{P), xeA. 



When is a = 0? By its definition (42), a is zero if and only if the function h{x) is constant over x, 



which, in this case, can only happen if P{x) is constant over x A. Therefore, here: a'^ = if and 
only if the source has a uniform distribution over A. 

Example 5: Binary Source with Hamming Distortion: Consider the simplest non-trivial lossy 
example: Let X be an i.i.d. source with Bernoulli(p) distribution (for some p E (0,1/2]), let A = 
A = {0, 1}, and take p to be Hamming distortion: p{x,y) = \x — y\. For D G (0,p) it is not hard to 
evaluate all the relevant quantities explicitly (see, e.g., |^, Example 2.7.1] or [^, Theorem 13.3.1]). In 
particular, the optimal reproduction distribution Q* is Bernoulli(g), with q = {p — D)/{1 — 2D), and 
our function of interest is: 



log; 



1 - D 



Recalling that the minimal coding variance is zero if and only if h{x) is constant, from the above 
expression we see that, similarly to the previous example, also here: = if and only if the source 
has a uniform distribution. 

For discrete sources, the next result gives conditions under which the characterization suggested 
by these two examples remains valid. Suppose A = A = {ai,a2, ■ ■ ■ ,Ofc} is a finite set, write pij for 
p{ai, Oj), and assume that p is symmetric and that pij = if and only if i = j. We call p a permutation 
distortion measure, if all rows of the matrix {pij)ij=i^...^k are permutations of one another. 

Theorem 23: Variance Characterization j^^: Let X be a discrete source with distribution P and 
rate-distortion function R{D). Assume that R{D) is strictly convex over (0, -D). There are exactly 
two possibilities: 

(a) Either (P, D) is only zero for finitely many D £ (0,D). 

(b) Or = cr^{P, D) = for all D £ (0, D), in which case P is the uniform distribution on A and 
/9 is a permutation distortion measure. 
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A general discussion of this problem, including the case of continuous sources, is given in [24|. 
Also, in the lossless case, the problem of characterizing when cr^ = for sources with memory is dealt 
with in 1 45]. 



Before moving on to waiting times and match-lengths we mention that, in a somewhat similar 
vain, the problem of understanding the best expected redundancy rate in lossy data compression has 



also been recently considered in [p5|) p3|||l80||l37 |. 



5.2 Waiting Times 

Next we turn to waiting times. Recall that, given D > and two independent realizations of the 
stationary ergodic processes X and Y, the waiting time Wn was defined as the time of the first 



appearance of X" in Y with distortion D or less (see (27) for the precise definition). In Theorem 14 
we gave conditions that identified the first order limiting behavior of Wn- In particular, when Y is 
i.i.d., it was shown in Theorem 14 (a) that 

log W„ 

^^^R,{P,Q,D) w.p.l (44) 
n 

where P and Q are the first order marginals of X and Y, respectively. 

The next result gives conditions under which the SLLN-type statement of (^4|) can be refined to a 
CLT and a LIL. 

Theorem 24-' CLT and LIL for Waiting Times: Let X be a stationary a- mixing process and Y 
be an i.i.d. process, with marginal distributions P and Q, on A and A, respectively. Assume that 
the a-mixing coefficients of X satisfy ( ^8|) and that -D3 = Epxq[p^{X,Y)] is finite. Then for any 
D £ (-Dmin, -Dav) the following series converges 



^ L:p[g-[Ai)\+-2 

k=2 

with g{x) defined as in (|39|), and, moreover: 



= Ep[g\Xi)] +2Y,Ep[giXi)giXk)] (45) 



(CLT) With Ri = Ri{P,Q,D): 

log Wn - nRi V 



n 



iV(0,a2). 



(LIL) The set of limit points of the sequence 

logWn - nRi 



n > 3 



y/2n log log n 
coincides with [— o", fi], with probability one. 

Proof Outline: For a bounded distortion measure p, Theorem 24 was proved in |23]. To obtain 
the more general statement above combine the strong approximation of Theorem 13 with the second 
order AEP in Corollary 19 to get: 

n 

logWn = nRi{P,Q,D)+Y,9{Xi)+0{logn) w.p.l. (46) 

j=i 
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Since X satisfies the mixing assumption (pSD , so does the process {g{Xn)}- Also, since A* < 0, 
the function Aa;(A*) is bounded above by zero, and by Jensen's inequaUty it is bounded below by 
\*Eq[p{x,Y)]. Therefore, 

\K{\*)\<\X*\Eq[p{x,Y)] 

and this, together with Holder's inequality and the definition of g{x), imply that £^p[|g'(Xi)p] < oo. 
Therefore we can apply the CLT of p^ , Theorem 1.7] to the process {g{Xn)^ in order to deduce the 
CLT-part of the theorem from (^6|). Similarly, applying the LIL of [^] to {g{Xn)}, from ([4^ we get 
the LIL-part of the theorem. □ 

Remark 5: When the variance cr^ in (45) is positive, then the functional versions of the above CLT 



and LIL given in still hold, under exactly the conditions of Theorem 24. (This follows by applying 
the functional CLT of Theorem 1.7] and the functional LIL of [l60| . Theorem 1 (IV)].) 

5.3 Match-Lengths and Duahty 

Finally we turn to our last application, match- lengths. Recall that, given a distortion level D > 
and two independent realizations of the processes X and Y, the match- length L^, is defined as the 
length £ of the longest prefix Xf that appears (with distortion D or less) starting somewhere in the 
"database" Y{^. See ( |3l| ) for the precise definition. As we briefly mentioned in Section 3.4, there is a 
duality relationship between match-lengths and waiting times: Roughly speaking, long matches mean 
short waiting times, and vice- versa; see (^). 

Although the relation ( |33| ) is not as simple as the duality (^2|) for exact matching, it is still possible 
to use (^3|) to translate the asymptotic results for Wn to corresponding results for L^- These are given 
in Theorem 25 below. This translation, carried out in ||2^, is more delicate than in the case of exact 
matching. For example, in order to prove the CLT for the match-lengths Lm one invokes the functional 
CLT for the waiting times (see Remark 5 above and the proof of Theorem 4 in 



Theorem 25: Match-Lengths Asymptotics: Let X be a stationary process and Y be an i.i.d. process, 
with marginal distributions P and Q, on A and A, respectively. Assume that D3 = £'pxq[p^(A', y)] 
is finite. Then for any D € (Dmin,-Dav) we have 

(LLN) ^ w.p.l 

log m Ri 

where Ri = Ri{P, Q, D). If, moreover, the a-mixing coefficients of X satisfy (|3^ ) and the variance cr^ 
in ( pSD is nonzero, then, with = a'^R^^ , we have, 

r log m 

(CLT) ^rn-^ ^ ^^^^^2^ 

Vlog m 

_ logm 

(LIL) limsup = T w.p.l. 

m^oo V 2 log m log log log m 



The results of Theorem 25 were proved in |23] for any bounded distortion measure p. The slightly 



more general version stated above is proved in exactly the same way, using the results of Section 4 in 



place of Theorems 2 and 3 of ||23|] . 
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6 Random Fields — First Order Results 



This and the following section are devoted to generalizations of the results of Sections 2-5 to the case 
of random fields. Specifically, the role of the processes X and Y will now be played by stationary 
ergodic random fields X = {X^ ; u S Z'^} and Y = {Yy_ ; u G Z*^}. As we will see, many of the 
problems that we considered have natural analogs in this case, and the overall theme carries over: The 
generalized AEP and its refinement can be extended to random fields, and the corresponding questions 
in data compression and pattern matching can be answered following the same path as before. 



6.1 Notation and Definitions 

The following definitions and notation will remain in effect throughout Sections 6 and 7. 

We consider two random fields X = {X„ ; u G Z"^} and Y = {Y^ ; u G Z"^}, d > 2, taking values 
in A and A, respectively, and indexed by points u = {ui,U2, ■ ■ ■ ,Ud) on the integer lattice Z'^. As 
before, A and A are complete, separable metric spaces, equipped with their Borel c-fields A and A, 
respectively. Let P and Q denote the (infinite-dimensional) measures of the entire random fields X 
and Y. Unless explicitly stated otherwise, we always assume that X and Y are independent of each 
other. 

Throughout the rest of the paper we will assume that X and Y are stationary and ergodic. 
To be precise, by that we mean that the Abelian group of translations {Tu : u G Z'^} acts on 
both {A^'' , A^''' ,¥) and {A^'' ,A^'' ,Q) in a measure-preserving, ergodic manner; see for a detailed 
exposition. 

For v,w G Z'^, the distance between v and w is defined by 

div.w) = max \vi — wA 
i<i<d 

and the distance between two subsets V,W C Z'^ is 

d{V,W)= inf d{v,w). 

Given t>, tt) G Z*^, we let [t", = {n G Z'^ : Vj < uj < Wj for all j}, where [v,w] is empty in case 
Vj > Wj for some j. 

We write C{n) for the d-dimensional cube of side n > 1, 

C{n) = {n G Z^ : 1 < uj < n for all j} 

and [0, oo) for the "infinite cube" 

[0, oo) = {ti G Z"* : Uj > for ah j}. 

For an arbitrary subset [/ C Z'^ we let \U\ denote its size; for example, |C(n.)| = n'^. Also for [/ C Z^ 
we write 

Xu = {Xu;ue U} 

so that, in particular, X[o^oo) = {^u ; Uj > for all j}. For V C Z'^ and n G Z*^ we let n + C/ denote 
the translate 

u + V = {u + v : V €V}. 
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For each n > 1, let P„ denote the marginal distribution of X(j[n) on , and similarly write Qn for 
the distribution of lc'(n)- Let p : A x A ^ [0, oo) be an arbitrary nonnegative (measurable) function, 
and define a sequence of single-letter distortion measures pn : 

Pn{xcin),yC{n)) = ^ ^ PiXu.Vu) Xc>(„) € £ 
MgC(n) 

Given D > and 2;,^(„-) G we write B{xQ(^n)j ^) foi' the distortion-ball of radius D: 

Bixc(n),D) = jl/c-H e : Pn(a:^C(n),yC{n)) < ^} • 

6.2 Generalized AEP 

It is well-known that the classical AEP 

--logPn{X^)^H{F) w.p.l 
n 

generalizes to the case of finite-alphabet random fields on Z"', as well as to other amenable group 
actions . In this section we give two versions of the generalized AEP of Theorems 1 and 4 to the 
case of random fields on 7,'^. 



Y is i.i.d. In the notation of Section 6.1, we take X to be a stationary ergodic random field with 
first order marginal Pi = P, and Y to be i.i.d. with first order marginal Qi = Q. We define -Dmin 
and Dav as in the one-dimensional case (recall equations (0) and (P)), and assume that p{x,y) is not 
essentially constant for (P-almost) all x £ A, that is, I?min < -Dav 

A simple examination of the proof of Theorem 1 shows that it extends verbatim to the case of 
random fields, with the only difference that instead of the usual ergodic theorem we now need to 
invoke the ergodic theorem for Z,'^ actions; see [52, Chapter 6]. We thus obtain: 



Theorem 26. Generalized AEP when Y is i.i.d.: Let X be a stationary ergodic random field on 
WJ^ and Y be i.i.d., with marginal distributions P and Q on A and A, respectively. Assume that 
Dav = EpycQ[p{X,Y)] is finite. Then for any D G (Dmin,P*av) 

-^logQ'^'(i?(Xc(„),D)) ^ Ri{P,Q,D) w.p.l 
with the (one-dimensional) rate-function Ri{P,Q,D) defined as in Theorem 1. 



Y is not i.i.d. Let X and Y be stationary random fields and define -Dav and -Dmax exactly as in 
the one-dimensional case (recall (|8|) and (0)). We assume that the distortion measure p is essentially 
bounded, Dmax < oo, and define 

D ■ - sudL>*^"^ - lim D^"'^ (A7) 

^mm — sup -L^jnin — mil 1^^^^ {'i I ) 

n>l 

where 

^min = ^Pn[ essinf p„(XcH,yc(n))]. (48) 
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To see that the hmit in ( p7|) exists and equals the supremum, first note that {n'^D^^^} is an increasing 
sequence, and that D^l^ > D^J^^^ for all n,k > 1. Now fix A; > 1 arbitrary. Given n > k we write 
n = mk + r for some < r < /c — 1, so that 

n^Z^it > (mkrDi-^^ > {mkrDi^l 
Since n/mk ^ 1 as n — > oo, this implies that 

liminfl?it>l?it- 

Since k was arbitrary we are done. 

Finally, we assume once again that the distortion measure p is not essentially constant, that is, 
-Cmin < -Dav Our next result is the random fields analog of Theorem 4; it is proved in Appendix C. 

Theorem 27. Generalized AEP rate function. Let X and Y be stationary random fields. Assume 
that p is bounded, and that with P-probability one, conditional on X[o,oo) = ^[0,00)5 the random 
variables {pn{xc(n)->'^c(n))} satisfy a large deviations principle with some deterministic, convex rate- 
function. Then for all D G (-Dmim -^av), except possibly at D = 



mm ' 



lim -i^logQ„(5(Xc(„),Z))) = i?(P,Q,Z)) w.p.l (49) 

n— >oo 77," ^ ' 



where D^^^ and the rate-function -R(P, Q, D) are defined as in the one-dimensional case, by (|T7| ) and 
(16), respectively, and the rate-functions Rn{Pn,Qn, D) are now defined as 

Rn{Pn,Qn,D) = iuf ^H{Vn\\Pn X Qn) 

with the infimum taken over all joint distributions Vn on A"'' x A^'' such that the ^"''-marginal of Vn 
is Pn and EvJpn{Xc(n),Yc{n))] < D. 

Remark 6: Suppose that {X, Y) is a stationary random field satisfying a "process-level LDP" with 
a convex, good rate-function. To be precise, given Xc-j-^) G A^ , write x^"^ for the periodic extension 
of xc{n) to an infinite realization in and let X^") and y(") denote the periodic extensions 

of -^c(n) ^■^d lc(n); respectively. The process-level empirical measure induced by X and Y on 
(^[0,00) ^ ^[0,00)) -g defined by 

where (J^^s' denotes the measure assigning unit mass to the joint realization (s,,s') G x 
and -'^l+jo^oo) (o'^ ^«+[o,oo)) denotes X^") (respectively, Y^")) shifted by u [i.e., the value of X^+[o,oo) 
at position v is the same as the value of X*^"-* at position u + v] similarly for ^^"|o 00) ■] assuming 
that {X,Y) satisfy a "process-level LDP" we mean that the sequence of measures {Cn} satisfies the 
LDP in the space of stationary probability measures on (^A^^'^^i x A''^'""^) equipped with the topology 
of weak convergence, with some convex, good rate-function /(•). These assumptions are satisfied by 
many of the random field models used in applications, and in particular by a large class of Gibbs fields 
(see, e.g., |l^[Q[Q for general theory and [35|[73| for examples in the areas of image processing and 
image analysis). 



33 



As in the one-dimensional case, suppose that the process-level LDP condition holds, and that the 
distortion measure p is bounded and continuous on Ax A. Then with P-probability one, conditional 
on X[o^oo) = 2;[o,oo)) t^is sequence {pn{xc(n)i'^c{n))} satisfies the LDP upper bound with respect to 
the deterministic, convex rate-function J(-) as in Remark 3. Moreover, assuming sufficiently strong 
mixing properties for Y one may also verify the corresponding lower bound (for example, by adapting 
the stochastic subadditivity approach of []l7|). 



6.3 Applications 

In Sections 6.3.1 and 6.3.2 below we consider the random field analogs of the problems discussed in 
Section 3 in the context of one-dimensional processes. In the instances when our analysis was restricted 
to i.i.d. processes, the extension to random fields is trivial - an i.i.d. random field is no different from an 
i.i.d. process. For that reason, we only give the full statements of corresponding random fields results 
when the generalization from d= 1 to d>2 does involve some modifications. Otherwise, only a brief 
description of the corresponding results is mentioned. 



6.3.1 Lossy Data Compression 

Here we very briefly discuss the problem of data compression, when the data is in the form of a two- 
or more generally a d-dimensional array. In this case, the underlying data source is naturally modeled 
as a d-dimensional random field. Extensive discussions of the general information-theoretic problems 
on random fields are given in Q and the recent monograph |84|; see also [33]. 



First we discuss the results given in Section 3.1. The construction of the random codebooks 
described there generalizes to random fields in an obvious fashion, and the statement as well as the 
proof of Theorem 9 remain unchanged. Following the notation exactly as developed for i.i.d. sources, 
the strengthened coding theorems given in Theorems 10 and 11 follow by combining (the obvious 
generalization of) Theorem 9 with the generalized AEP of Theorem 26. 

Similarly, the mismatched-codebook results of Section 3.2 only rely on Theorem 9 and the gen- 
eralized AEP of Theorem 1, and therefore immediately generalize to the random field case. Finally 
Theorems 15 and 16 in Section 3.5 are only stated for i.i.d. processes, hence, as mentioned above, they 
trivially extend to random fields. 



6.3.2 Waiting Times 

Here we consider the natural d-dimensional analogs of the waiting times questions considered in 
Section 3.3. Given two independent realizations of the random fields X and Y , our main quantity 
of interest here is how "far" we have to look in Y until we find a match for the pattern X(j(^n) with 
distortion D or less. Given n > 1 and a distortion level > 0, we define the waiting time Wn as 
the smallest length i such that a copy of the pattern Xc[n) appears somewhere in Ycj^j+^.i), with 
distortion D or less. Formally, 

Wn = inf{i > 1 : pn{Xc(n):yu+C{n)) < D for some u£[0,i- 1]'^} 

with the convention that the infimum of the empty set equals -|-oo. 

In the one-dimensional case our main tool in investigating the asymptotic behavior of the waiting 
times was the strong approximation in Theorem 13. Roughly speaking. Theorem 13 stated that 
the waiting time Wn for a D-close match of X" in Y is inversely proportional to the probability 
QniB(Xi , D)) of such a match. In Theorem 28 below we generalize this result to the d-dimensional 
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case by showing that the d-dimensional volume (Wn)'^ we have to search in Y in order to find a D-close 
match for is, roughly, inversely proportional to the probability D)) of finding such 

a match. 

Before stating Theorem 28 we need to recall the following definition. Dobrushin's non-uniform 
(p-mixing coefficients of a stationary random field Y are 

Mk) = sup{\QiB\A) - Q{B)\ : B £ aiVu), A e a{Yv), Q{A) > 

< i, \V\ < oo, d{U,V) > k} 

where cr{Yu) denotes the cr-field generated by the random variables Yu, U C Z'^. See Chapter 6] 
or for detailed discussions of the coefficients {(pi^k)} and their properties. 

Theorem 28. Strong Approximation: Let X and Y be stationary ergodic random fields, and 
assume that the non-uniform (/>-mixing coefficients of Y satisfy 

oo 

limsup^(i + lY-^(lin{jn) < oo. (50) 

n-^oo . -, 

If Qn{B{Xc[n)-, D)) > eventually with probability one, then for any e > 0: 

-(l + e)logn < log[W^QniB{Xc(n),D))] < (d+l + e)logn eventually, w.p.l. 



The proof of Theorem 28 is a straightforward modification of the corresponding one-dimensional 
argument in |23]; it is given in Appendix D. 



Remark 7: The mixing condition ( [50| ) is satisfied by a rather large class of stationary random 
fields. For example in the case of Markov random fields, it is easy to check that under Dobrushin's 



uniqueness condition the limit in ( pOj ) is finite; see [34, Section 8.2] or [^] for more details. 

Next we combine the above strong approximation result with the generalized AEPs of Theorems 26 
and 27, to read off the first order asymptotic behavior of the waiting times. Theorem 29 below 
generalizes Theorem 14 to the random field case. 

Theorem 29. SLLN for Waiting Times: Let X and Y be stationary ergodic random fields: 
(a) If Y is i.i.d. and the average distortion Dav is finite, then for any D G (I?mirn ^av) 



1 



^logW^^RiiPuQuD) w.p.l. 



n' 



(b) Suppose that the conditions of Theorem 27 are satisfied, and that Y also satisfies the mixing 
assumption (pO[). Then, for any D e {D^^, Da_y): 



n 



^logW^^ R(F,Q,D) w.p.l. 
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7 Random-Fields — Second Order Results 



Finally we turn to the random field extensions of the second order results of Sections 4 and 5. In 
Section 7.1 we state the random field analog of the second order generalized AEP, and in 7.2 we discuss 
its application to the problems of lossy data compression and pattern matching. 



7.1 Refinements of Generalized AEP 

Let X he a stationary ergodic random field with marginal distribution P on A, and let Q be a fixed 
probability measure on A. We will assume throughout that the distortion measure p has a finite third 
moment, 

D3 = Ep^q[p^{X,Y)]<^ (51) 

and that it is not essentially constant, i.e., Dmin < Da.v, with I?niin and Dav defined as before (cf. (|^) 
and i)). 

The goal of this section is to give the random field analogs of Theorems 17 and 18 and of Corol- 
lary 19 from the one-dimensional case. 

An examination of the proof of Theorem 17 in |^T[ shows that its proof only depends on the 
ergodicity of X and the i.i.d. structure of the product measures Q". Simply replacing the application 
of the ergodic theorem by the ergodic theorem for Z*^ actions |5^, Chapter 6] immediately yields the 
following generalization: As long as condition (|5lD is satisfied, for all D £ (-Dmim-Cav) we have 



-logQ"'(i?(Xc(„),D)) =n'^iii(Pn,Q,i?) + ^logn + 0(l) w.p.l (52) 

where Pn is now the empirical measure induced by -'^c'(n) ^• 

In order to generalize Theorem 18 to Z*^ we need to introduce a measure of dependence analogous 
to a-mixing in the one-dimensional case. For a stationary random field X on Z'^ we define the uniform 
a -mixing coefficients of X by 

a{k) = sup{\¥{AnB) -F{A)F{B)\ : Aea{Xu), B £ a{Xv), d{U,V) > k} 



where, as before, a{Xij) denotes the cr-field generated by the random variables Yjj. See ||5j]||23l for 
more details. 

Apart from ergodicity, the main technical ingredient in the proof of Theorem 18 above (see also 
the proof of |23, Theorem 3]) is the LIL for X. Similarly to the one-dimensional case, the LIL for a 
random field X holds as soon as the following mixing condition is satisfied 



a 



{k) < C k-^'^'~^+^\ for some e > and C < oo. (53) 



[This follows from the almost sure invariance principle injP, Theorem 1].] 

Assuming that (^) and the third moment condition (|5l|) both hold, we get the following general- 
ization of Theorem 18. For all D G (i^mirn -Dav), 



Ri{Pn,Q,D)=n''Ri{P,Q,D)+ ^ ^(XJ + O(loglogn) w.p.l (54) 

JiGC{n) 



with g{x) defined exactly as in the one-dimensional case (| 

Combining (|5^) and (^) gives the following generalization of Corollary 19: 
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Theorem 30: Second Order Generalized AEP: Let X be a stationary ergodic random field with 
marginal distribution P on A, and let Q be an arbitrary probability measure on A. Assume that the 
uniform a-mixing coefficients of X satisfy (^) and that D3 = Epxq[p^{X,Y)\ is finite. Then for any 
D G (-Dmim -Dav), and with g{x) defined as in 



-\ogQ''\B{X'^,D)) = n''Ri{P,Q,D)+ + ^ logn + O(loglogn) w.p.l. 

ueC{n) 

7.2 Applications 

Next we discuss applications of the second order generalized AEP to the d-dimensional analogs of the 
data compression and pattern matching problems of Section 4. As in Section 6.3, the only results 
stated explicitly are those whose extensions to Z'^ require modifications. 

As mentioned in Section 6.3.1, the one-dimensional construction of the random codes, as well as 
the main tool used in their analysis. Theorem 9, immediately generalize to the random field case. And 
since all the second order results of Section 5.1 (Theorems 20-23) are stated for i.i.d. sources, their 
statements as well as proofs carry over verbatim to this case. 

For the problem of waiting times, we can use the second order generalized AEP of Theorem 30 to 
refine the SLLN of Theorem 29 

^logW^^Ri{P,Q,D) w.p.l 

to a corresponding CLT and LIL as in the one-dimensional case. These refinements are stated in 
Theorem 31 below. Its proof is identical to that of Theorem 24 in the one dimensional case. The only 
difference here is that we need to invoke the CLT and LIL for the partial sums of the random field 
{g[Xu) ; u S Z'^}. Under the conditions of the theorem, these follow from the almost sure invariance 
principle of Theorem 1]. 

Theorem 31: Let be a stationary ergodic random field and Y be i.i.d., with marginal distributions 
P and Q on A and A, respectively. Assume that the uniform a-mixing coefficients of X satisfy (|5^ ) and 
that = Epy^Q[p^{X,Y)] is finite. Then for any D G (-Dmim -Dav) the following series is absolutely 
convergent 

a^=Y Ep[9{Xo)g{X^)] (55) 

with g{x) defined as in (|39|), and, moreover: 
(CLT) With Ri = Ri{P,Q,D): 

logW^ - n'^Ri V 



nd/2 

(LIL) The set of limit points of the sequence 

ilogW^ - n'^Ri 



iV(0,(j2). 



I yj2n'^ log log n 
coincides with [— cj, a], with probability one. 



n > 3 
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A Proof of Theorem 7 

We prove the upper and lower bounds separately. For the upper bound, recalling the definition of 
r„(X") in ( pO[ ) we observe that 

rn{X^,D) < -log Pn{B{X'^,D)) --logQ^{X^) 
n n 

where the second term converges to H{P) + H{P\\Q) as n — > oo, by the ergodic theorem. Since the 
first term is increasing in D, for any fixed D > we have with P-probability one: 

limsup rn{X^,D) < HiP) + H{P\\Q) +\imsup- log PniB{X^,D)). (56) 

71 ^ oo n^oo n 

D 10 



Now the pointwise source coding theorem (see |51, Theorems 1 and 5]) implies that 

llmmf -- log Pn{B {X"^, D)) > R{D) w.p.l (57) 

n— >oo n 

where R{D) is the rate-distortion function of the source X (in nats). From equations (^) and (p7|) 
we get 

limsup r„(Xi",i?) < H{P) + H{P\\Q) - R{D) 

n — > oo 
D 10 

< H{P) + H{P\\Q)-H(F) + H{P)-Ri{D) w.p.l 

where Ri{D) denotes the first order rate-distortion function of X, H{F) is the entropy rate of X 
(both in nats), and the second inequality follows from the Wyner-Ziv bound; see |7^, Remark 4]. The 
assumption that p{x, y) = if and only if x = y implies that Huid^q Ri{D) = H{P), so letting D I 
the above right hand side becomes H{P) + H{P\\Q) — H{F) and it is an easy calculation to verify that 
this is indeed the same as i7(P||Q). This gives the required upped bound. 
For the lower bound we proceed similarly by noting that 

rrr{X^,D) > -logPniX^) - - log {B {X^ , D)) , 
n n 

where the first term converges to H{V) by the classical AEP (as n — > oo). Since the second term is 
decreasing in D, for any fixed D > small enough we have with probability one: 

liminf r„(Xi",L>) > -iJ(P) - limsup - log Q"(B(Xf, L»)) 

n — > oo n— »oo n 
D 10 

= -H{F) + Ri{P,Q,D) 

where the last step follows from the generalized AEP in Theorem 1 (note that -Dmin = here). By 
the characterization of the rate-function in Proposition 2 we know that 

Ri{P,Q,D) = sup[X'D - A(A')] > [XD - A(A)] = -Ep \log Eq (e^CpC^-^)-^) 

A'<0 L ^ 
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for any fixed A < 0. Therefore, for any D small enough and A < we have 
liminf r„(X]",L>) > -H{¥) - Ep 

n ^ oo 
D I 

Letting D ^ and then A — > — oo, by the dominated convergence theorem (and the assumption 
p{x,y) = iff X = y) the right hand side above converges to -H{¥) + H{P\\Q) + H{P) = H{¥\\Q), 
proving the lower bound. 

Finally, since for each fixed n the limit as D | of r„(X",D) exists, it follows that the repeated 
limit lim^lim^) also exists and is equal to the double limit H{P\\Q). □ 



log-Eg 



,\{p(X,Y)-D) 



w.p.l. 



B Proof of Theorem 8 

Part (a): Fixing n, let = dPn/dQn and consider the set 

A„ ^ : Q„(B(.f , «)) > V« > 0, /„(.f , . .i.sup MM^ . ...yf |(|(|^ } . 

By the Radon-Nikodym theorem (cf. [p9| . Theorems 1.6.1, 1.6.2]), we know that Qn{An) = 1, hence 
also Pn{An) = 1. With ¥{DnA^) = 0, we conclude the proof of part (a) by applying Theorem 6 for 
Mn = (in which case Hn>0). 

Part (b): As Q{Ai) = 1, in particular Q{B{x,D)) > for all D > and Q-almost every x e R'^ 
(hence also for P = Pi-almost every x G M'^), implying that Dmin of (0) is zero. The same argument 
yields also that P{B{x, D)) > for all D > and P-almost every x, hence -Dmin is still zero if we 
replace Q by P. Thus, for all D < min{EpxQ[p{X,Y)], Epxp[p{X,Y)]}, applying Theorem 1 twice 
we get 

lim rn{Xl\D) = Ri{P,Q,D) - Ri{P,P,D) w.p.l. 



For any probability measure p and any A < 0, let 



A(A;/x) = I log I e^^(^'J^)d/i(y) 



dP{x). 



Fixing D > small enough, we have by Proposition 2 that Ri{P,P,D) = XD — A(A;P) for 
the unique A = X{D) < such that A'(A;P) = D, whereas Ri{P,Q,D) > XD - A(A;Q). Since 
Epxp[piX,Y)] > 0, we have also that X{D) I — oo as | (see (0)). Consequently, 

limmf{Ri{P,Q,D) - Ri{P,P,D)} > liminf{A(A; P) - A(A; Q)} 

Similarly, by Proposition 2 we have Ri{P,Q,D) = XD- A(X;Q)Jot A < such that A'(A;Q) = D, 
Ri{P,P,D) > XD - A(A;P), and with Epxq[p{X,Y)] > 0, also X i -oo when D j 0. Therefore, it 
suffices to show that 

Mm {A(A; P) - A(A; Q)} = H{P\\Q) . (58) 

A|— oo 

To this end, for any A < and x G M'^, let 

A ^p(e^^(^'^)) 



hx{x) 



EQ{e^p(-'y)) 
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noting that 

A(A;P) - A(A;Q) = j log hx{x)dP{x). 
Using the change of variable U = p{x, Y) >0 followed by integration by parts, we see that 

/o°°e^"5'x(u)dti 



where gx{f) = P{B{x,r)) and kx{r) = Q{B{x,r)) are nonnegative, nondecreasing and bounded above 
by 1. Considering separately u < 2r] and u > 2r], it is easy to check that for any r/ > 0, 

sup + V'A,x > hx{x) > mf -— - — — — (59) 

0<r<2r, kx{r) 0<r<2r? kx{r) 1 + 1px,x 



where 



Fix x € Ai part (a), in which case kx{r) > for all r > and gx{f) /kx{r) fiix) as r ^ 0. 
Letting A J, — oo and then r/ — > 0, it follows by (l59|) and (]60|) that 



lim hx{x) = fi{x) . 

AJ,— oo 

Recall that P{Ai) = 1 and our assumption that J log kx{'i])dP{x) > —oo for any rj > 0. By our 
integrability conditions, the function min{0, infA>i log hx{x)} is P-integrable, hence, by Fatou's lemma, 

liminf / loghx{x)dP{x) > [ log fi{x)dP{x) = H{P\\Q). 

AJ.-00 J J 

Moreover, in case H{P\\Q) < oo, our assupmtions imply that sup;^>]^ | loghx{x)\ is P-integrable, hence 
by dominated convergence, / loghx{x)dP{x) J log fi{x)dP{x) for A J, — oo, as required to complete 
the proof of (ID . □ 



C Proof of Theorem 27 



Recall our assumption that, for P-a.e. X[o,oo)]) conditional on X[o,oo)] = ^[o,oo)] the random variables 
{Pnixc{n)j^C{n))} Satisfy the LDP with a deterministic convex good rate-function denoted hereafter 
i?(P, Q, •). Since p is bounded, by Varadhan's lemma and convex duality, this implies that 



A 



R{F, Q, D) = snp[XD - Aoo(A)] = Al^iD) 



(61) 



where for any A G M, the finite, deterministic limit 

A 1 

n— >oo n 



Aoo(A) = lim_ — log 



exists for P-a.e. X[o,oo) (cf- HI, Theorem 4.5.10]). By bounded convergence, Aoo(A) is also the limit of 



log 



dPn{xc{n)) 
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By stationarity, 



Dav = Ep„ xQ„{pn {Xc(n) , ^C(n) ) ) , Vn > 1 (62) 

so replacing Pi, Qi and p{x, y) of Proposition 2 by Pn, Qn and "nf^ Pn{xc{n)}yc{n))j respectively, we see 
that 

Rn{Pn, Qn, D) = sup[AD - A„(A)] = K^D) . (63) 



Note that |A„(A) — A„(A')| < c|A — A'| for some c < oo and all n. A, A' G M, hence the convergence of 
A„(-) to Aoo(-) is uniform on compact subsets of M. In particular, the convex, continuous functions 
A„(-) converge infimally to Aoo(-), and consequently, by [^, Theorem 5], the convex functions A*(-) 
converge infimally to A^(-), that is 

A;^(D) = limlimsup inf A;(Z)) = lim liminf inf ■ (64) 

5^0 n-*oo |D-D|<<5 <5^0 \b~D\<5 

It follows from ( |62| ) and Jensen's inequality that A„(A) > ADav for all n and A, hence, for D < Dav 
suffices to consider A < in ( |6T1) and in (|63|) . Thus, for 1 < n < oo. A* are non-negative, convex, and 
monotone non-increasing on [0,-Dav]) with A*(L)av) = 0. For 1 < n < oo, let 

(n) A A^ 
'-"^ Ai^ A ' 

SO that A;(D) = oo for D < while A;(D) < oo for L> > D^^^. Note that for n < oo this coincides 

with the definition of D^^^ given in (]48|). It is easy to check then that (|6^) implies the pointwise 
convergence of A;(-) = i?„(P,Q,-) to A^(-) = i?(P,Q,-) at any D for which A*^{D - 6) i A^(D), 
that is, for all D ^ -D^^ In particular, necessarily G [-Dmin, -Dav], and may also be defined 

via (|l^. The continuity of R{F,q,D) at D e (I)min,^av), D / d'^^ implies the equality in (|9|) for 
such D, thus completing the proof of the theorem. □ 

D Proof of Theorem 28 

For each m > 1, let Gm be the collection of "good" realizations x^d S A^'' 

Gm = [x^d e ■■ Qn{B{xc(n),D)) > for all n > m} 
so that the assumption that Qn{B[X(j(^^-^,D)) > eventually, with probability one translates to 

IP{U^>iG^} = 1. (65) 

To prove the lower bound we choose and fix an m > 1 and a realization x^d G Gm- Then for any 
K>1: 

} < X] Qn{yu+C{n) ^ B{xc(n),D)] 

< KQn{B{xci:a),D)). 
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Since, by its definition, Wn is always greater than or equal to one, this inequality trivially holds also 
for K G (0, 1]. Setting K = [n^^'^Qn{B{xc(^n)i D))]^^ above gives, for all n> m, 

Pv{log[W^Qn{B{XcXn),D))] < -(1 + e)logn|Xc(„) = < 

Since this bound is uniform over x^d £ Gm and summable, the Borel-Cantelli lemma and assumption 
(65) imply that 



log[W^Q„(5(Xc(„),D))] > -(l + e)logn eventually, w.p.l. 



(66) 



For the upper bound we choose and fix an m > 1 and a realization x^d S Gm, and take K > (n+1)'^. 
Note that 



Pr{< > K I = < Pr <j Yl hYr.u+cir.,eBiXc,r.,,D)} 

^u€[0,M]'' 







where the sum is over the {M + l)*^ integer positions u E [0, M]'^ C Tj'^, nu denotes the point 
{nui,nu2, . . . , nud) G Z'^, and 



A 



n 



M = M{K, n] 
Let Tin denote the sum in the above probability, 

= ^ Iniu) 

where /„(n) is the indicator function of the event {i^„+c(n) ^ ^{Xq^^-^, D)}. In this notation: 
Pr{< > K I Xc(„) = xch} < Q{S„ = 0} < 

By stationarity 

EQ{Tn) = [M + lfQn{B{xcin),D)) 
and by the definition of the 0-mixing coefficients, u ^ v, 

EQ{Iniu)In{v)} < Qn{B{xc(n), D))[(j)n{nd{u,v) - n + 1) + Qn{B{xc(n),D))]. 

Using the last two estimates we can bound the variance as 
VarQ{S„} = ^ Covq(/„(u),/„(w)) 

< [M + lfQn{B{xcin),D)) 

+ X] Qn{B{xc{n),D))4>nind{u,v) -n + 1] 

M,Dg[0,M]<', u^v 

M 



(67) 



(68) 



< [M + l]%^{B{xc^n),D)) 



l + 5^Cd/"Vn(nj-n + l) 



(69) 
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where Cdj'^~^ bounds the number of possible points u that can be at a distance exactly j from a given 
point V (for some constant q). By assumption ( pO| ) we can find a finite constant $ such that the 
expression in square brackets in (p9D is bounded above by ^, uniformly in n. Substituting this bound, 
together with (|68|) and (|69|) , in (|67|) , gives 



[M + lJ''Q„(S(xc(„),D)) 

Let e > arbitrary, take n large enough so that 71^^+'^^'^ > 2, and let K = n'^~^^~^'^/Qn{B{xc(n)^D)). 
Simple algebra shows that with this choice of K we have 

[M + l]'^Q„(i3(xc(„),D))> 

and substituting this in ( [70| ) yields 

2^> 



Pr{log[T<Q„(5(xc(„),I)))] > id + l + e)logn\Xc(n)=xc{n)} 



< 



This bound is uniform over xid G Gm and summable, so the Borel-Cantelli lemma and (65) imply 
that 

log [W^Q„, (5 (Xc(„), D))] < (d + 1 + e) logn eventually, w.p.l. (71) 
Combining (|7^) and (B^) completes the proof. □ 
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